Open RobertRosca opened 4 years ago
I'm just looking through the issues you created last night. Is it worth taking one of the code paths which currently uses Dask, implementing the same calculation with multiprocessing, and seeing how the performance compares?
There are two angles to this. Obviously if we can get similar or better performance with a simpler system, we could just use that. But also, if we do use Dask, knowing how it compares to multiprocessing gives us a better idea how fast we can expect it to be, and whether rechunking & transfer overhead is a problem.
Yeah, the original plan I worked out with Mario was that we would move away from Dask and I would rewrite the analysis to use multiprocessing, or Ray, or something with more manual control so that we can know exactly what's going on.
However, I thought that we should wait until midtools is stable, because these manually written distributed computing functions might end up being pretty specialised and if it turns out that changes have to be made to the analysis then it might be the case that the multiprocessing code would also have to change, and doing that very close to (or during) the experiment could be quite difficult and time consuming.
The other thing to consider is that comparing the performance of dask when the task graph looks... like it currently does would be a bit unfair, so we should probably figure out how to 'fix' the current dask behaviour before trying to benchmark it against alternative approaches if we want to get a good comparison.
Based on my current experience, I'm not convinced that it's easier to rapidly change code using Dask than code using multiprocessing. It's theoretically simpler if you can rely on the abstraction Dask provides, but if that abstraction breaks, understanding what's going on and how to make it work can be much more effort. And the use cases involving apply_along_axis
are probably easy to do in multiprocessing. This is up to you as the people who're going to be working on it, though.
I agree it's not exactly a 'fair' comparison if Dask is having a bad day and we haven't yet learned how to optimise it. But part of what I'm thinking is to have a benchmark for what speed we can expect. One of the tough things with performance is just knowing how fast things should be. If a calculation takes 10 minutes, is that because loading the data and doing the work takes that long, or is it doing a load of unnecessary work? Will a simple fix make it 10x faster, or will it take 2 days of optimisations to get 10% improvement?
I am now working quite intensively with Dask and xarray for some weeks and what I find really useful is the "simplicity" for the "end-user". Of course it is not really simple and I may not be the typical user but having for example the coordinates of the xarray data objects and the easy way of setting up a SLURMCluster with Dask is quite useful.
I see your point that a simple apply_along_axis
may be easy to implement with multiprocessing but in the end there will be a lot of details that change from project to project and you might end up in a situation where you actually write something like Dask from scratch because each request from an instrument or user group is a bit different than the last one.
As Robert mentioned, there is a lot of improvement potential both on the user-code and the dask-graph side. I think it is worth to explore this a bit more than starting completely from scratch.
The task graph for the SAXS compute step can be seen here:
Near the end a substantial amount of connections are being made between workers and it looks like a lot of transfers are occurring, as (at least in my experience) a common error is one about connection timeouts (as per #5), it's possible that the volume of these connections is leading to network communication issues/port conflicts.
Might be worth investigating why such a high number of transfers are scheduled in the task graph, and how to optimise this.