lumia-dev / lumia

https://lumia.nateko.lu.se/
European Union Public License 1.2
0 stars 3 forks source link

Lumia crashes unexpectedly due to an OOM - Out Of Memeory - ERROR #1

Open isbjoernen opened 4 months ago

isbjoernen commented 4 months ago

These last few days I was questioning my sanity after several unexpected crashes of Lumia without any tangible entries in the log files. I was hunting for ghosts asking myself if unintended changes in lumia MasterPlus were to blame for the crashes, but after conversing with Andre I finally got a lead. In all cases, the host machine had killed my jobs with an out of memory error (OOM), which I'm sharing below.

Is there a 'best practice' on how to avoid these or is there a way to monitor the memory usage or limit Lumia's use of memory?

oom-error

(LumiaMaster) arndt@skuggfaxe:~/nateko/run/lumiaMaster$ sudo journalctl -g oom -S '2 days ago' Jul 02 04:15:22 skuggfaxe kernel: kthreadd invoked oom-killer: gfp_mask=0x102dc2(GFP_HIGHUSER|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0 Jul 02 04:15:22 skuggfaxe kernel: oom_kill_process.cold+0xb/0x10 Jul 02 04:15:22 skuggfaxe kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Jul 02 04:15:22 skuggfaxe kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-6.scope,ta> Jul 02 04:15:22 skuggfaxe kernel: Out of memory: Killed process 19358 (python) total-vm:17951768kB, anon-rss:15682764kB, file-rss:1652kB, shmem-rss:4kB, UID:1001 pgtables:31520kB> Jul 02 04:15:22 skuggfaxe kernel: oom_reaper: reaped process 19358 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB Jul 02 04:15:14 skuggfaxe systemd[1]: session-6.scope: A process of this unit has been killed by the OOM killer. Jul 02 07:34:32 skuggfaxe kernel: mutagen-agent invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 Jul 02 07:34:32 skuggfaxe kernel: oom_kill_process.cold+0xb/0x10 Jul 02 07:34:32 skuggfaxe kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Jul 02 07:34:32 skuggfaxe kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1001.slice/session-6> Jul 02 07:34:32 skuggfaxe kernel: Out of memory: Killed process 19311 (python) total-vm:18281732kB, anon-rss:15990580kB, file-rss:1248kB, shmem-rss:4kB, UID:1001 pgtables:32132kB> Jul 02 07:34:32 skuggfaxe kernel: oom_reaper: reaped process 19311 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Jul 02 07:33:44 skuggfaxe systemd[1]: session-6.scope: A process of this unit has been killed by the OOM killer. Jul 03 06:26:15 skuggfaxe systemd[1]: user.slice: A process of this unit has been killed by the OOM killer. Jul 03 06:26:15 skuggfaxe systemd[1]: user-1001.slice: A process of this unit has been killed by the OOM killer. Jul 03 09:03:59 skuggfaxe sudo[36475]: root : TTY=pts/4 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/journalctl -g oom -S '1 days ago' Jul 03 09:04:21 skuggfaxe sudo[36480]: root : TTY=pts/4 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/journalctl -g oom -S '1 days ago' Jul 03 09:06:57 skuggfaxe sudo[36497]: arndt : TTY=pts/0 ; PWD=/home/arndt/nateko/run/lumiaMaster ; USER=root ; COMMAND=/usr/bin/journalctl -g oom -S '2 days ago'

isbjoernen commented 4 months ago

Hi Arndt,

I don't have any "best practice", I would be curious about what configuration lead to this out of memory error, and at which point of the execution it happens exactly? The one time I had such errors was when using a very large state vector (many emission categories). The workaround I found was to switch to the optimizer from scipy, which is much more memory efficient (and the results are quite comparable).

Cheers, Guillaume

isbjoernen commented 4 months ago

Hi Guillaume,

the OOM errors occurred between 2 and 6 hours into the run (there was one earlier example I did not include) - this was while multitracer.py was called, I think, sometimes after being called 30 time and once after 259 times. However, these were run on skuggfaxe which is a VM on fsicos4 with 16 cores. With ncpus=8 in previous runs with either LumiaMaster or LumiaDA I did not experience these issues using congrad. However, these later runs I was more aggressive on resources using ncores=14 and 12 and that is when things went south.

Also, the example run I got from Yohanna indeed uses more emissions categories (5 instead of 3) compared to earlier runs, but even so, with 8 cpus all went well 2 weeks ago.

Also I noticed that in the Pool() statement used in the uncertainties calculations always the max number of available cpus is requested regardless of user choices, while in the transport model section, the --ncpus option is honored. I was just trying to limit the former in the same way, to see if it makes any difference, because before knowing about the system OOM, I had started last night to debug my issues, but only got as far as the uncertainties calculations. There I got a memory error in the debugger after starting the Pool() process. So from there I got my first hint on the memory issues, which prompted me to get in touch with Andre who looks after skuggfaxe.

I wonder if one could dream up a monitoring routine for the available memory that could prevent further processes being spawned once memory usage reaches say 80% of all that is available....

Anyway, thanks for the suggestion of using scipy. I can give it a shot.

Cheers,

Arndt

isbjoernen commented 4 months ago

Hi Arndt

On 03 Jul 2024 11.08, Arndt Meier wrote:

Hi Guillaume,

the OOM errors occurred between 2 and 6 hours into the run (there was one earlier example I did not include) - this was while multitracer.py was called, I think, sometimes after being called 30 time and once after 259 times. If it happens while multitracer.py is called, then it has nothing to do with the optimizer, and the only real solution is to reduce the number of processes. Depending on your configuration, it might be worth using less processes in the adjoint than in the forward. In my code (i.e. more or less the latest version on github), this can be achieved through the "model.extra_arguments" keys in the yaml file:

model : extra_arguments : apri : --check-footprints --copy-footprints ${machine.footprints_cache} adjoint : -n 5 var4d : -n ${run.ncores} apos : -n ${run.ncores}

It's probably useful for you to do something like that (the adjoint processes write big files, which the main process reads again just after, and this takes a lot of time, so a compromise can be to perform the adjoint on less cores, it's a bit slower but less time is waster in I/O). It will definitely reduce the memory requirements in adjoint runs. In forward run it won't change anything, but I'm not sure whether it's needed to change anything (I don't know how the netCDF library handles memory when several processes point to the same netCDF file in read-only mode: I suspect they all point to the same memory addresses, so opening the file multiple times doesn't increase the memory requirements).

Another thing you should check is where the temporary files are written. By default, I tend to write a lot of stuff in /dev/shm (i.e. in a RAM file system), as it's a lot faster. But if you are memory limited, it's obviously not a good idea.

However, these were run on skuggfaxe which is a VM on fsicos4 with 16 cores. With ncpus=8 in previous runs with either LumiaMaster or LumiaDA I did not experience these issues using congrad. However, these later runs I was more aggressive on resources using ncores=14 and 12 and that is when things went south. How much memory do you have on that machine?

Also, the example run I got from Yohanna indeed uses more emissions categories (5 instead of 3) compared to earlier runs, but even so, with 8 cpus all went well 2 weeks ago. How large is your emission file?

Also I noticed that in the Pool() statement used in the uncertainties calculations always the max number of available cpus is requested regardless of user choices, while in the transport model section, the --ncpus option is honored. I was just trying to limit the former in the same way, to see if it makes any difference, because before knowing about the system OOM, I had started last night to debug my issues, but only got as far as the uncertainties calculations. There I got a memory error in the debugger after starting the Pool() process. So from there I got my first hint on the memory issues, which prompted me to get in touch with Andre who looks after skuggfaxe. I think it's fine to leave the Pool the way it is. It's just used to compute the distances between points, it's not heavy in memory at all (and it occurs really at the beginning of the simulation, so if your run crashes after two hours, that's not the issue).

I wonder if one could dream up a monitoring routine for the available memory that could prevent further processes being spawned once memory usage reaches say 80% of all that is available.... But in that case, some processes would have to wait for others to complete, so you still get the overhead from parallelization without the benefits. So instead you should dynamically tune the number of parallel processes. You can probably do something like that based on the output of psutil.virtual_memory() ... good luck ;-)

Anyway, thanks for the suggestion of using scipy. I can give it a shot.

Cheers,

Arndt

isbjoernen commented 4 months ago

Guillaume: I think it's fine to leave the Pool [in the uncertainties calculations] the way it is. It's just used to compute the distances between points, it's not heavy in memory at all (and it occurs really at the beginning of the simulation, so if your run crashes after two hours, that's not the issue).

I think this was only an issue for me while trying to debug from within my IDE, which seems to cause a significant overhead. In the screenshot from 'top' you can see 14 processes occupying 17.6g of virtual memory each (I limited the pool to 14 out of 16 cpus). if run normally, these processes take 12.3g each. Perhaps I'm overly cautious, but as a rule of thumb I don't give more than 90% of available cpus to a single Pool() and always leave at least one cpu free, so the system has a chance to react to human or other intervention, monitoring, etc.

top

At this stage, memory does not seem to be an issue (yet). While above is running I get:

`free -m

                    total         used          free      shared  buff/cache   available 

Mem: 62805 15140 18894 1 29477 47664 Swap: 0 0 0`

isbjoernen commented 4 months ago

I got around the OOM error described above by reducing the number of cores allowed in the the adjoint run as suggested by Guillaume. I have pushed a minor change to the masterPlus branch that ensures that the key model:extra_arguments:adjoint: -n 6 is favored in the transport.py adjoint over the also present key '*': -n ${machine.ncores} . Thus, setting model:extra_arguments:adjoint: -n to about half the number of ${machine.ncores} in your lumia config.yaml file might be a good conservative first guess on a "low memory" system, when trying to find good values for these machine tuning parameters.

I repeated the test run I got from Yohanna and it completed successfully (run id LumiaMasterPlus-2024-07-04T14_58) with the latest changes. So you can now go ahead and try the masterPlus branch for yourself.