mohamed82008 / KissThreading.jl

Simple patterns supporting working with threads in Julia
MIT License
37 stars 4 forks source link

More options like batchsize #25

Open oxinabox opened 4 years ago

oxinabox commented 4 years ago

This is the same as https://github.com/tkf/Transducers.jl/issues/201 So I am basically copypasting that issue:

Several functions take a batchsize option.

One main use of this is when threading to avoid the cost of @spawn dominating over the cost of the actual work. Set it too low and @spawn cost dominates. Set it too high, and if the work is uneven then some threads will be sitting around with nothing to do.

batchsize makes it easy to specify if you know roughly how long each item should take to process. Rule of thumb is something like set batchsize such that processing that many takes about 1ms.

If one the other hand you don't really have good idea how long something is, but know how even it is. then something else is desired. If it is expected to be exactly even then optimal is batchsize = div(length(work), nthreads()). If one wants to soften that because less confidant how ev en then perhaps: batchsize = div(length(work), 10nthreads()).

I am not sure the best way to expost this. One option might be to have say batchsize=0 or batchsize=:even todo the even splits. I suspect even splits isn't a great option for default in the equal case anyway though, since one might have a thread get taken by another process running (outside of dedicated machines). Another might be basesize=:auto to do say batchsize = div(length(work), 10nthreads()), which is probably a better bet than even.

Perhaps a fuller API would be useful. say sizing taking a number of possible options like sizing = BatchSize(1), or sizing = Even() or sizing = TimeEstimate(mean=0.1, std=0.5)

mohamed82008 commented 4 years ago

Exposing different batching strategies is interesting. Perhaps we can just provide utility functions to calculate the batch size to be passed to the batch_size kwarg. This will require minimal changes to the code and can let us define arbitrary batch size calculating functions separately. The disadvantage here is that it would be a two-step API which for advanced users, I think is not that big of a deal.