jcus0006 / mtdcovabm

Distributed Covid 19 Agent Based Model modelled on Maltese data.
0 stars 0 forks source link

Meeting 02/12/2023 - Established Strategies #27

Open jcus0006 opened 11 months ago

jcus0006 commented 11 months ago

Also established the strategies that we may focus on:

  1. multi-processing
  2. dask distributed with isolated workers
  3. dask distributed with multiprocessing
  4. dask actor based

with actor based, to implement something as below:

  1. workers are stateful
  2. load assignment happens at the beginning
  3. have weights for workers, agents weight, cells weight (and/or others)
  4. workers share state information between them without going back to the client (each worker will know where each cell is from the very beginning)
  5. to consider whether to have a similar concept of shared memory across workers (like strategy 3)
jcus0006 commented 11 months ago

First version of Dask with multiprocessing complete. Proper testing is now required, especially with the full population. Need more fine-grained time logging and error handling inside the remote method that delegates the multiprocessing. Error handling must take into consideration the fact that both the Dask workers and the multiprocessing processes might fail. Dask_workers_time_taken and Dask_mp_processes_time_taken can be used for logging; right now the former was being used for load balancing purposes and the latter was not being used.

jcus0006 commented 11 months ago
jcus0006 commented 11 months ago

The second strategy seemed to offer better results for the itinerary but slightly worse results for the contact network. This could be related to the fact that the itinerary required sending much more data, and now it is being sent to the nodes rather than to the workers.

The idea to work on a hybrid approach whereby the tourists are maintained as a multiprocessing.manager.dict() seems to work well in general, and initially seemed to provide better timings for the contact network. However, when running it, around the 10/11 day mark the processes started being killed due to being out of memory. Not as yet sure why. Some ideas to try out:

  1. Log general, worker and process level memory usage after each day (use psutil) it seems that the actual processing generates a lot of unmanaged memory that Dask does not get a hold of it back as quickly as it should. this is particularly true for the processes that would have been started via the multiprocessing.pool. there doesn't seem to be an easy way to get hold of this memory again. now I am trying with client.restart() and it will be timed, i.e. it might be counter-intuitive if restarting the cluster takes a long time. also it could be that the client is bloating and not actually the workers. this is something that must be considered
  2. This issue is likely to be happening on the main node because the main memory is growing large, and the Dask worker, combined with the 3 multiprocessing processes are using up more than 20gb available memory. Might need to find ways to maintain less memory on the client; or use less processes on the main node. the former is a bit out of scope for the rest of the project. the latter is a possibility but to see first how client.restart() affects all this
  3. Try to explicitly release any temporary memory from all worker methods. But make sure that this doesn't affect the data that is sent back to the client. there did not seem to be much such cases
  4. Important - Sharing the cell data as "global" variable copies the object over to each process for normal dict scenarios (non multiprocessing.RawArray cases). It may make more sense to actually re-load each structure from the JSON file into memory, because this allows us to dispose of the structures as soon as we are ready from them (while this is not possible with the global variables) this is not tried yet but I am not sure the effort is justified. we would need a clear timing of work assignment which would then be compared to the loading from JSON files. it might be worth a shot if I have the time

Also found out that the second strategy is splitting the work equally across the available nodes, and not considering the number of workers in each node. This must be fixed as soon as possible - this is supposedly fixed but requires testing

Also an issue had occurred in one of the runs whereby a "pairid" being deleted from the potential contacts dict was not found. A lot of logs were put in place to try to reproduce this issue again, to no avail. Additionally, these logs might have been creating a performance degradation, due to the logs ending up being in the magnitude of hundreds of megabytes. To retry without the added logs, and hopefully to never re-occurs.