Open oxinabox opened 4 years ago
Exposing different batching strategies is interesting. Perhaps we can just provide utility functions to calculate the batch size to be passed to the batch_size
kwarg. This will require minimal changes to the code and can let us define arbitrary batch size calculating functions separately. The disadvantage here is that it would be a two-step API which for advanced users, I think is not that big of a deal.
This is the same as https://github.com/tkf/Transducers.jl/issues/201 So I am basically copypasting that issue:
Several functions take a
batchsize
option.One main use of this is when threading to avoid the cost of
@spawn
dominating over the cost of the actual work. Set it too low and@spawn
cost dominates. Set it too high, and if the work is uneven then some threads will be sitting around with nothing to do.batchsize
makes it easy to specify if you know roughly how long each item should take to process. Rule of thumb is something like setbatchsize
such that processing that many takes about 1ms.If one the other hand you don't really have good idea how long something is, but know how even it is. then something else is desired. If it is expected to be exactly even then optimal is
batchsize = div(length(work), nthreads())
. If one wants to soften that because less confidant how ev en then perhaps:batchsize = div(length(work), 10nthreads())
.I am not sure the best way to expost this. One option might be to have say
batchsize=0
orbatchsize=:even
todo the even splits. I suspect even splits isn't a great option for default in the equal case anyway though, since one might have a thread get taken by another process running (outside of dedicated machines). Another might be basesize=:auto to do saybatchsize = div(length(work), 10nthreads())
, which is probably a better bet than even.Perhaps a fuller API would be useful. say sizing taking a number of possible options like
sizing = BatchSize(1)
, orsizing = Even()
orsizing = TimeEstimate(mean=0.1, std=0.5)