Meeting Points 19/10/2023

Since we last spoke, I did some further preliminary tests.

Contrary to what I had reported to you during our last meeting, the difference between the distributed setup and the multiprocessing setup (with the latter being much faster), was not purely down to being limited with less resources due to having to setup Linux VMs on Windows hosts, but mainly because of the overhead of using Dask and splitting the work into a lot of "smaller" tasks (this was initially thought to be better, and indeed it does utilise less memory, but it also lost a lot of time in communication).

Having said that, I did find a few other areas of optimisations that helped me win some more time. This includes splitting the workload as balanced as possible over the available worker nodes; hence, sending less tasks. But I also optimised to pre-load as much of the static data on the workers at the beginning, and send as little data as possible during the actual runtime.

With 4 workers on the same node 280-320secs With 6 workers, 258.35 secs.

This is much better compared to before, where the distributed setup was much slower than the single node run, throughout.

Single node with no multiprocessing and no dask had run at 385.93s. Multiprocessing had run at 160.6s (x 4 workers)

However, I don't think we can easily compare multiprocessing to Dask, 1) because Dask adds some additional overhead to assign tasks to different workers through a network, while multiprocessing manages this communication internally 2) with multiprocessing, while getting more performance in the context of a single node, you cannot scale beyond the single node.

Now I finally upgraded the other 2 machines I have to 16gb RAM, so I could set a Linux VM on them with 6 logical processors each and 10gb of RAM. I could setup 4 worker nodes on each of these machines:

I ran 12 worker processes in all on this cluster, and got: 218.19 seconds.

Considerable time is lost by the lowest performant node, because this setup that I am using now is more similar to the initial one I was using with multiprocessing, where the whole load is split over the available nodes at once. i.e. if there are 12 processes, the whole load is split, as balanced as possible, over the 12 processes. However, when assigning the workload, this strategy is not taking the performance of each node into consideration. And because it sends a single task to each worker process, there cannot be dynamic load balancing within the same execution of the same function (e.g. itinerary for day 1). What I can think of is adjusting the load according to the performance of the nodes prior to assigning the worker (i.e. load balancing but not dynamically during runtime). This is something that I can try, probably, rather quickly,

Tomorrow I am going to get access to 2 more machines from work, I will check their specs and see whether we can use them. I also have a Macbook Pro 2013 with i7 2.3ghz and 16gb DDR3, I might be able to use it in the same cluster (I will test whether I can SSH into it). I might also ask you for the machine you offered to lend me. Not sure how many machines you want me to use? My switch contains 5 ports, but my colleague offered me a gigabit switch which has 8 ports. With the mentioned machines, I would have 7. Perhaps I can get another machine too to tally up to 8.

In the meantime I am going to finish the distributed implementation for the contact network. And I am also going to implement a way to load the performance diagnostics of the cluster into log files (which until now I had been managing mostly in a custom way). I think it's time to be more accurate and use what Dask offers out-of-the-box.

There is much more to report on, but I think this is more than enough for now.

jcus0006 / mtdcovabm

Meeting Points 19/10/2023 #19