Meeting 02/12/2023 - Established Strategies

send multiprocessing version to Prof Abela so he can run it on his fast computer
implement report of results (pie chart) in terms of how the time is spent between assigning work, doing work, returning results, idle time, etc (even if worker based)
test dask with multiprocessing to implement the element of shared memory (would ideally be done with Dask for Ray but not enough time. using Dask, opening a single worker on each node, and then opening further processes (via Python's multiprocessing) from the Dask worker could be an alternative approach)
experiment with a stateful actor based approach. must first implement a test harness that demonstrates the use of message passing between Dask workers. refer to the end of this ticket for more information about the actor based strategy

Also established the strategies that we may focus on:

multi-processing
dask distributed with isolated workers
dask distributed with multiprocessing
dask actor based

with actor based, to implement something as below:

workers are stateful
load assignment happens at the beginning
have weights for workers, agents weight, cells weight (and/or others)
workers share state information between them without going back to the client (each worker will know where each cell is from the very beginning)
to consider whether to have a similar concept of shared memory across workers (like strategy 3)

First version of Dask with multiprocessing complete. Proper testing is now required, especially with the full population. Need more fine-grained time logging and error handling inside the remote method that delegates the multiprocessing. Error handling must take into consideration the fact that both the Dask workers and the multiprocessing processes might fail. Dask_workers_time_taken and Dask_mp_processes_time_taken can be used for logging; right now the former was being used for load balancing purposes and the latter was not being used.

To change handle_futures method to include parameter indicating caller method, as if it it crashes before, it would not have yet assigned the type bools (and then crashes on future.release()).
To handle tourism sync/clean up with actor based mode. Data not being synced remotely and causing soc_rate issue.
Also another issue in client-side tourism sync/clean up whereby when calling agents_static.set(), I am getting an error indicating that the tourist agent id being passed is actually None. This seems to have only happened when shm is enabled. Might warrant a debug with the small population. both mp.rawarray and numpy array versions cannot be reduced in size. for the time being the actual None assignment has been removed, both on the client side and remotely. however, a hybrid approach that utilise mp.manager.dict and normal dicts to rerpesent static tourist data in multiprocessing and dask environments, respectively might be considered. this would allow us to only maintain the data for active tourists and would save up tons of memory. refer to inline notes in static.py for full details
To handle more fine-grained time logging (which might be used for pie chart), e.g. time it takes to start each worker. (work in progress
To use Dask_mp_processes_time_taken for logging, currently, only logging main Dask workers timings.
Use the above 2 points to understand where the time is being lost within the Dask with MP strategy. The time supposedly won by the data transfer is being lost somewhere else. And are we being faster to assign work anyway?
Possibly add logs to compare the amount of work being assigned to each worker/process between strategy 1 and 2.
The itinerary seems to be faster, while contact network seems to be slower. Try to find out why. (is it simply because the itinerary used to lose a lot more time because it had to send more data?)
Start working on harness for purely actor based.

The second strategy seemed to offer better results for the itinerary but slightly worse results for the contact network. This could be related to the fact that the itinerary required sending much more data, and now it is being sent to the nodes rather than to the workers.

The idea to work on a hybrid approach whereby the tourists are maintained as a multiprocessing.manager.dict() seems to work well in general, and initially seemed to provide better timings for the contact network. However, when running it, around the 10/11 day mark the processes started being killed due to being out of memory. Not as yet sure why. Some ideas to try out:

Log general, worker and process level memory usage after each day (use psutil) it seems that the actual processing generates a lot of unmanaged memory that Dask does not get a hold of it back as quickly as it should. this is particularly true for the processes that would have been started via the multiprocessing.pool. there doesn't seem to be an easy way to get hold of this memory again. now I am trying with client.restart() and it will be timed, i.e. it might be counter-intuitive if restarting the cluster takes a long time. also it could be that the client is bloating and not actually the workers. this is something that must be considered
This issue is likely to be happening on the main node because the main memory is growing large, and the Dask worker, combined with the 3 multiprocessing processes are using up more than 20gb available memory. Might need to find ways to maintain less memory on the client; or use less processes on the main node. the former is a bit out of scope for the rest of the project. the latter is a possibility but to see first how client.restart() affects all this
Try to explicitly release any temporary memory from all worker methods. But make sure that this doesn't affect the data that is sent back to the client. there did not seem to be much such cases
Important - Sharing the cell data as "global" variable copies the object over to each process for normal dict scenarios (non multiprocessing.RawArray cases). It may make more sense to actually re-load each structure from the JSON file into memory, because this allows us to dispose of the structures as soon as we are ready from them (while this is not possible with the global variables) this is not tried yet but I am not sure the effort is justified. we would need a clear timing of work assignment which would then be compared to the loading from JSON files. it might be worth a shot if I have the time

Also found out that the second strategy is splitting the work equally across the available nodes, and not considering the number of workers in each node. This must be fixed as soon as possible - this is supposedly fixed but requires testing

Also an issue had occurred in one of the runs whereby a "pairid" being deleted from the potential contacts dict was not found. A lot of logs were put in place to try to reproduce this issue again, to no avail. Additionally, these logs might have been creating a performance degradation, due to the logs ending up being in the magnitude of hundreds of megabytes. To retry without the added logs, and hopefully to never re-occurs.

jcus0006 / mtdcovabm

Meeting 02/12/2023 - Established Strategies #27