Closed gilv closed 3 years ago
@gilv Good idea! This is something we must add.
Just note that in Lambda, Azure, and other serverless platforms this is called batching
;)
Then we can add a new parameter in the map function called batch_size
or something like this.
@JosepSampe does multiprocessing pool support chunking in Lithops? https://docs.python.org/release/2.6.6/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.map
@JosepSampe it should be chunksize
based on their documentation
@gilv current lithops.multiprocessing.Pool implementation ignores chunksize
parameter since this functionality is not currently supported by Lithops' core map. If the batch_size
argument suggested by @JosepSampe is implemented, then chunksize
could be directly passed to Lithops' map batch_size
.
@gilv We already have chunck_size
parameter used by the COS partitioner. IMO, if we call it chunksize
it will be very confusing.
@JosepSampe good point.. so can we extend chunk_size
to support chunking of arrays as well... and then @aitorarjona can implement chunking for multi processing API
@gilv Currently you can have an iteradata
that contains 1000 references to COS objects. So if a user sets chunk_size=5
, how do we have to act? Do we have to create 5 chunks of iterada which will result on 5 functions/vms, or do we have to split each file in 5 parts, resulting on 5000 functions?
I agree that if MP api has chunksize
, it is better to put the same variable name just to be consistent everywhere. One solution i see here is to create the new variable chunksize
for creating iteradata
chunks (like MP api), and then move current chunk_size
to partitionsize
(or something like this) for creating object partitions. The main constrain of this is that it is not backards compatible. The main benefit is that it is much easy to understand for end users, rather encapsulate multiple different logics into the same variable name, which will be very confusing
done in #553
When input array is large it might be valuable to support chunking of it. As example if interdata = [array of length 10000] and chunking = 1000, there will be 10 invocations each processing 1000 elements.
More info can be found here https://stackoverflow.com/questions/3822512/chunksize-parameter-in-pythons-multiprocessing-pool-map