icl-utk-edu / cluster

2 stars 0 forks source link

Can't easily find core files #2

Open G-Ragghianti opened 7 months ago

G-Ragghianti commented 7 months ago

@abouteiller

abouteiller commented 7 months ago

A bit more details here

ulimit -c 0 is the default, and that is good, should stay like that. If core files are needed the user should request it using ulimit -c unlimited.

What happens when we run with core generation enabled

Running srun -wleconte -n1 testing_redistribution...

Core files are created into /var/lib/systemd/coredump/core.testing_redistr.1003.87f81996462b4acbb4e80cb69dbe57c2.1901042.1707242127000000.zst on the system on which the task is ran (e.g., Leconte, when the user is on Methane).

Investigating the bug is only possible by running a set of commands to know what is the name of the core file and running gdb remote

srun -wleconte ls /var/lib/systemd/coredump
srun -wleconte unzstd /var/lib/systemd/coredump/core.testing_redistr.1003.87f81996462b4acbb4e80cb69dbe57c2.1901042.1707242127000000.zst -o $HOME/core
gdb testing_redistribution core

What we'd like

  1. Core files are collected on a shared scratch filesystem that is easily referenced as /cores
  2. Core files can be passed to gdb directly, the fact that they are compressed with zstandard saves space, but our gdb version cannot read that, maybe we need a newer version of the gdb spack?

We had a similar setup on Saturn that can probably be replicated here. In particular the scratch filesystem was using relaxed locking/consistency semantics to avoid NFS freaking out when written to from multiple nodes at the same time.