Open joernhees opened 7 years ago
You are right, SCOOP generates futures from every element of the iterable first. The reason is to be able to distribute those tasks to remote workers. If map_as_completed
returned an iterable, tasks would be emitted one after another, making the computation serial and not parallel, unless we batch the inputs, as you said.
Batching the input explicitly would require the user to manually emit batches when he desires. I haven't found a way to provide a clean and simple API to perform this.
Batching implicitly (on SCOOP's end) would delay the scheduling of later tasks, which could hinder load balancing. I am not formally against this, as long as the default value does not cause surprise to the power user and performs suboptimally for the beginner. The emission of tasks should be done in the communication thread, though, it should not wait for a call to SCOOP API from the user's code. I'm open to pull requests in this direction.
Perhaps a solution would be for the user to provide an argument. 'take=1000' Then the map could take 1000, balance and dispatch then take another 1000 until stop iteration.
My system runs out of RAM when the iterator is consumed all at once 🤗🤓
passing huge lists / iterables into
map
ormap_as_completed
will first "register" them all for computation and only after it exhausted them all, compute them in parallel.try running the following with
python -m scoop example.py
and notice how nothing is printed for a long time:I think the reason is in https://github.com/soravux/scoop/blob/master/scoop/futures.py#L94 . In python 3
map
returns an iterable. Even for python 2 it would be cool if the internal function would batch the input.