nextsimhub / nextsimdg

neXtSIM_DG : next generation sea-ice model with DG
https://nextsim-dg.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
10 stars 13 forks source link

Memory leak! #254

Open timspainNERSC opened 1 year ago

timspainNERSC commented 1 year ago

The TOPAZ-ERA5 year-long thermodynamics only run has a memory usage of roughly 2 MB per day (18.5 MB–70-some MB over the course of a 31 day run). This implies something is leaking memory.

timspainNERSC commented 1 year ago

A 128x128 double array has a size of 128 kiB, or 8 arrays per MiB. This implies roughly 16 arrays per day of run time, or one array every 90 minutes, or every 9 timesteps. This implies that it is not an array that is created an leaked every timestep, but either a smaller amount of data every timestep or an array with lower frequency.

dorchard commented 1 year ago

valgrind is a tool that can be used for this (see leak-check argument in the Quick start guide).

timspainNERSC commented 1 year ago

And Valgrind doesn't work on modern MacOS :(

dorchard commented 1 year ago

Oh noes :( I will have another think...

timspainNERSC commented 1 year ago

I've been using std::cerr and Mach task_info.

Based on this stackexchange post

timspainNERSC commented 1 year ago

Most of the memory is leaked in library code.

Half of the apparent leaking occurred when doing whole-array mathematics on ModelArrays. The internet suggests that Eigen is a bit lax at cleaning up some temporary arrays, which the ModelArray maths certainly used. Changing the SlabOcean update to be a per-element calculation remove the leak there, at the cost of some performance.

The majority of the remaining leak occurs when reading the netCDF forcing files. Again the leak is occurring during library calls, so my code is not directly responsible. Refactoring would seem to be impossible at this point, but I can at least reduce the number of calls by only reading the forcing files when I know the values in ERA or TOPAZ change (one per hour and once per day respectively).

It has also been suggested by @draenog and @a-smith-github that the memory is not truly leaking, but is just released memory allocations that haven't yet been cleaned up by the OS.