jcus0006 / mtdcovabm

Distributed Covid 19 Agent Based Model modelled on Maltese data.
0 stars 0 forks source link

Meeting 06/11/2023 #22

Open jcus0006 opened 10 months ago

jcus0006 commented 10 months ago
jcus0006 commented 10 months ago

Some updates:

  1. Made various attempts to fix memory usage. Found various issues along the way. Started making use of a "start day" marker to reduce memory usage in the directcontacts array by removing the rather redundant "day", and keep a range as a dict, e.g. {1: 0, 2: 3560} indicating that day 1 starts from 0 and day 2 starts from 3560. Then implemented a clean up of the array, but it means that both the start date index, as well as the agent1 and agent2 indexes need to be re-computed after removing the first "n" elements. Also fixed some bugs that I found in the contact tracing base method along the way. After all this, it was taking a rather long amount of time. Initially I thought it was the clean-up + re-computation of the indexes, however, it only takes around 20-30 seconds, which in the grand scheme of things, is probably acceptable. The time is being lost in the binary search, and not because it is slow, but because of how large the directcontact array becomes. For e.g. up until day 2, the array had 4.3 million contacts, and was 500mb in size.
  2. This was being tested on purpose, taking into consideration 48 hours of contact tracing in mind, which essentially requires the current day and 2 previous days to be persisted. The clean-up requires some intermediate / working memory and it seems to be using up all the memory available on the client/scheduler node while cleaning up. Another attempt will be made using 24 hours of contact tracing, which should theoretically delay the memory from getting out of hand.
  3. It was also noted that the itinerary is taking longer to start than the contact network, and this is probably because of the agents_dynamic structure that is being split and sent to all the workers. An exercise should be made to confirm that all the data that is being sent and returned in this regard is actually necessary.
  4. Fixed an issue with the contact tracing whereby the agents to be contact traced were being added into the contact_tracing_agent_ids set when starting quarantine from the itinerary, but also when scheduling a test result that ends up in the agent being quarantined from the contact network (epidemiological part). The latter part was extra, and has since been removed.
  5. By this fix, the contact tracing is likely to start around the 8-10 day mark, which is not feasible for testing purposes with the 500k population. The 1k population test is not feasible to test the contact tracing. Using a 10k population could be a feasible compromise. Otherwise, a workaround needs to be planned out, e.g. checkpointing.
  6. Contact tracing has been re-implemented to use Dask distribution. Figuring out the necessity of this requires more testing.
  7. Contact tracing with distribution is likely to be very slow, because of the size of the directcontacts array (and the indexes). In this case it might be required to first send the array once; and then each worker loads it from a file into memory. The same approach can be considered for the other methods too, maybe it applies.
  8. It might be required that the directcontacts and the indexes arrays are persisted as files at the end of every simulation day to save up memory space that grows like crazy.
  9. I have also changed the implementation of the CellType and SimCellType slightly. These were being stored in memory as strings, e.g. "household" and "residence", for cell type and sim cell type, respectively. Now using Enums, they should utilise less memory space.
  10. The other points in the original description are yet to be considered.
jcus0006 commented 10 months ago

Tried a run with no workers on the client/scheduler node. Each worker takes around 2gb of RAM or maybe even a bit more. By doing this I ensured that the contact tracing is at least attempted. While it didn't crash with an out of memory "killed" error it still crashed around the same area of the first clean-up attempt. This is because the logs for the 3rd day of the itinerary and the contact network are present but the ones for the contract tracing / contacts clean-up are not. There were 2 contacts to be contact traced, and for this reason, one worker was going to to be started. But the contacts that had to be sent over were 750 mb for the directcontacts, and 450mb each for the indexes, which amount to 1.65gb.

The contact tracing stage seems to have run properly; seems to have run something for every worker, which is unnecessary but did not take a lot of time; and took 337 seconds. However, something happened around this time such that the Dask client became unusable and it ruined the rest of the process. Some extra logs need to be put in place to figure out what caused this.