soravux / scoop

SCOOP (Scalable COncurrent Operations in Python)
https://github.com/soravux/scoop
GNU Lesser General Public License v3.0
634 stars 87 forks source link

map should iterate #51

Open joernhees opened 7 years ago

joernhees commented 7 years ago

passing huge lists / iterables into map or map_as_completed will first "register" them all for computation and only after it exhausted them all, compute them in parallel.

try running the following with python -m scoop example.py and notice how nothing is printed for a long time:

#from scoop.futures import map as parallel_map
from scoop.futures import map_as_completed as parallel_map

def square(x):
    return x * x

if __name__ == '__main__':
    squares = parallel_map(square, range(1000000))
    for sq in squares:
        print(sq)

I think the reason is in https://github.com/soravux/scoop/blob/master/scoop/futures.py#L94 . In python 3 map returns an iterable. Even for python 2 it would be cool if the internal function would batch the input.

soravux commented 7 years ago

You are right, SCOOP generates futures from every element of the iterable first. The reason is to be able to distribute those tasks to remote workers. If map_as_completed returned an iterable, tasks would be emitted one after another, making the computation serial and not parallel, unless we batch the inputs, as you said.

Batching the input explicitly would require the user to manually emit batches when he desires. I haven't found a way to provide a clean and simple API to perform this.

Batching implicitly (on SCOOP's end) would delay the scheduling of later tasks, which could hinder load balancing. I am not formally against this, as long as the default value does not cause surprise to the power user and performs suboptimally for the beginner. The emission of tasks should be done in the communication thread, though, it should not wait for a call to SCOOP API from the user's code. I'm open to pull requests in this direction.

mjmdavis commented 7 years ago

Perhaps a solution would be for the user to provide an argument. 'take=1000' Then the map could take 1000, balance and dispatch then take another 1000 until stop iteration.

mjmdavis commented 7 years ago

My system runs out of RAM when the iterator is consumed all at once 🤗🤓