irmen / Pyro5

Pyro 5 - Python remote objects
https://pyro5.readthedocs.io
MIT License
305 stars 36 forks source link

Does this (or Pyro4) support batching of large data jobs? #6

Closed johann-petrak closed 5 years ago

johann-petrak commented 5 years ago

Like many other libraries, Pyro does seem to make it easy to "call" a function running in a different process or on a different machine which is great. But all examples I have seen so far send only a small amount of data to the worker process and the worker process is only expected to need a little time to succeed.

But can PyroX be used to distribute longer lasting processing of huge amounts of data among several workers? In this scenario a batch would be much bigger than what fits into memory and ideally k processes would, possible on l different machines, would be available to process the next piece of work from that data, as soon as they have finished the previous piece of work.

In this scenario, we want that memory is not consumed by stuffing too much data in for sending to the workers but we also do not want to allow any worker to become idle and wait for more work as long as there is still data to process.

Does Pyro5/4 provide anything to implement this?

irmen commented 5 years ago

Hello and thank you for this question. You're correct that Pyro's principal use is to facilitate remote method invocation between (usually Python) objects on different machines. Of course this also means you could very well pass a huge set of data as a parameter or return value, but Pyro is not optimized for that use case (and it won't work if the data doesn't fit in memory).

I've written a bit about this in Pyro's manual, see https://pyro4.readthedocs.io/en/stable/tipstricks.html#binary-data-transfer-file-transfer

It mostly talks about alternatives to send large amounts of data as a return value back to a client, for instance using Pyro's remote iterator feature which allows for chunked streaming of said data. It also refers to the filetransfer example (see Pyro4's example ) that opens a temporary raw socket to transfer the file outside of Pyro itself.

(edit regarding passing big data as a method argument to the server, this is something that you have to deal with yourself. But that should be fairly easy because your own client code has 100% control; it's trivial to pass the data in chunks to the server. It is not possible to stream this in ONE call though. But sequential Pyro calls using the same proxy simply reuse the single socket connection that the proxy has to the server so on a lower level you could consider this streaming chunks to the server, perhaps)

I hope this answers your question.

I know of several uses of Pyro for projects that transmit large volumes of (sattelite) image data over Pyro without problems, so perhaps there isn't a problem at all where you suspect one?

johann-petrak commented 5 years ago

Thank you but this is not so much about the mechanism of transfer and more about the logic of transfer in a parallel/distributed processing setting: when I have 100 workers and a million pieces of data, and each piece can keep a worker busy for anywhere between a few milliseconds to minutes, the process that produces the data cannot simply invoke remote processes in a synchronous manner.

The only way I can see is for the producer to somehow put all the data into a queue and all the workers retrieving the next item from the queue as soon as they are finished with the previous. But that queue must be able to get limited in size, otherwise, if the producer is much faster than all the workers, the queue would end up getting filled with almost all of the million of data pieces, eating up all memory or other resources. So my question is how this is normally done in the Pyro context? Does Pyro provide a limited size queue for this? What other method is used to achieve the distribution of large amounts of work between the remote workers?

irmen commented 5 years ago

Aha, I see - no, Pyro itself does not provide such a mechanism. You'll have to design and build it yourself (which does allow to tailor it to your precise needs). Some producer/consumer like examples are available among the Pyro4 examples so you could have a look there, but it really sounds like you are searching for a different solution that is designed to be a distributed data flow framework (which Pyro is not). Apache Spark or Apache Kafka or Dask or something like that may be a better choice?

irmen commented 5 years ago

I'll close this question for now don't hesitate to comment again if a need arises.