Open jcus0006 opened 11 months ago
First version of Dask with multiprocessing complete. Proper testing is now required, especially with the full population. Need more fine-grained time logging and error handling inside the remote method that delegates the multiprocessing. Error handling must take into consideration the fact that both the Dask workers and the multiprocessing processes might fail. Dask_workers_time_taken and Dask_mp_processes_time_taken can be used for logging; right now the former was being used for load balancing purposes and the latter was not being used.
The second strategy seemed to offer better results for the itinerary but slightly worse results for the contact network. This could be related to the fact that the itinerary required sending much more data, and now it is being sent to the nodes rather than to the workers.
The idea to work on a hybrid approach whereby the tourists are maintained as a multiprocessing.manager.dict() seems to work well in general, and initially seemed to provide better timings for the contact network. However, when running it, around the 10/11 day mark the processes started being killed due to being out of memory. Not as yet sure why. Some ideas to try out:
Also found out that the second strategy is splitting the work equally across the available nodes, and not considering the number of workers in each node. This must be fixed as soon as possible - this is supposedly fixed but requires testing
Also an issue had occurred in one of the runs whereby a "pairid" being deleted from the potential contacts dict was not found. A lot of logs were put in place to try to reproduce this issue again, to no avail. Additionally, these logs might have been creating a performance degradation, due to the logs ending up being in the magnitude of hundreds of megabytes. To retry without the added logs, and hopefully to never re-occurs.
Also established the strategies that we may focus on:
with actor based, to implement something as below: