chunking of large input array

lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

http://lithops.cloud

Apache License 2.0

320 stars 106 forks source link

chunking of large input array #511

Closed gilv closed 3 years ago

gilv commented 3 years ago

When input array is large it might be valuable to support chunking of it. As example if interdata = [array of length 10000] and chunking = 1000, there will be 10 invocations each processing 1000 elements.

More info can be found here https://stackoverflow.com/questions/3822512/chunksize-parameter-in-pythons-multiprocessing-pool-map

JosepSampe commented 3 years ago

@gilv Good idea! This is something we must add. Just note that in Lambda, Azure, and other serverless platforms this is called batching ;) Then we can add a new parameter in the map function called batch_size or something like this.

gilv commented 3 years ago

@JosepSampe does multiprocessing pool support chunking in Lithops? https://docs.python.org/release/2.6.6/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.map

gilv commented 3 years ago

@JosepSampe it should be chunksize based on their documentation

aitorarjona commented 3 years ago

@gilv current lithops.multiprocessing.Pool implementation ignores chunksize parameter since this functionality is not currently supported by Lithops' core map. If the batch_size argument suggested by @JosepSampe is implemented, then chunksize could be directly passed to Lithops' map batch_size.

JosepSampe commented 3 years ago

@gilv We already have chunck_size parameter used by the COS partitioner. IMO, if we call it chunksize it will be very confusing.

gilv commented 3 years ago

@JosepSampe good point.. so can we extend chunk_size to support chunking of arrays as well... and then @aitorarjona can implement chunking for multi processing API

JosepSampe commented 3 years ago

@gilv Currently you can have an iteradata that contains 1000 references to COS objects. So if a user sets chunk_size=5, how do we have to act? Do we have to create 5 chunks of iterada which will result on 5 functions/vms, or do we have to split each file in 5 parts, resulting on 5000 functions?

I agree that if MP api has chunksize, it is better to put the same variable name just to be consistent everywhere. One solution i see here is to create the new variable chunksize for creating iteradata chunks (like MP api), and then move current chunk_size to partitionsize (or something like this) for creating object partitions. The main constrain of this is that it is not backards compatible. The main benefit is that it is much easy to understand for end users, rather encapsulate multiple different logics into the same variable name, which will be very confusing

JosepSampe commented 3 years ago

done in #553