Closed LourensVeen closed 2 years ago
Hi Lourens. I welcome this move actually, as TCP is indeed more basic (i.e. easier to set up on supers) and also a bit easier to optimise for performance, should we ever have a requirement to do that :).
Well, gRPC runs over TCP, and it is very well optimised because much of Google's infrastructure runs on it, and at that scale every cycle counts. Also, it is used only for communication between the instances and the manager, which is not performance-sensitive (communication between instances is done peer-to-peer, outside of the manager).
The problem is that if you're outside of Google, it's quite a hassle to use, as it's not very well documented, the generated code is completely incomprehensible, and the build system is unreliable. I've had to nail it down to one particular version which mostly works, and then the MUSCLE3 build system applies some patches to protobuf and gRPC to get it to work in other places as well. But I keep running into problems with it, and since we already have a non-gRPC communication protocol that is used for peer-to-peer communication between instances, and for the TCP barrier that is used when receiving messages in MPI instances, we'll actually probably simplify the system by using that instead. And yes, that system can be optimised (and extended to support MPI for instance) in the future as well.
Released with 0.5.0.
While gRPC and protobuf are a very nice solution for communicating between the instances and the manager, it's just way too much of a hassle to get it installed if you're not Google and building the latest version with Blaze internally. Whether any given release will actually build seems to be a complete toss-up, and even after having carefully tested many of them and finding one that, with an added patch, seemed to work everywhere, we're now having build problems again.
Since we already have TCP servers and clients, and MessagePack for encoding things, it shouldn't be that much work to implement the MUSCLE manager protocol using that instead, and get rid of the dependency altogether.