Open G-Ragghianti opened 7 months ago
A bit more details here
ulimit -c 0
is the default, and that is good, should stay like that. If core files are needed the user should request it using ulimit -c unlimited
.
Running srun -wleconte -n1 testing_redistribution...
Core files are created into /var/lib/systemd/coredump/core.testing_redistr.1003.87f81996462b4acbb4e80cb69dbe57c2.1901042.1707242127000000.zst
on the system on which the task is ran (e.g., Leconte, when the user is on Methane).
Investigating the bug is only possible by running a set of commands to know what is the name of the core file and running gdb remote
srun -wleconte ls /var/lib/systemd/coredump
srun -wleconte unzstd /var/lib/systemd/coredump/core.testing_redistr.1003.87f81996462b4acbb4e80cb69dbe57c2.1901042.1707242127000000.zst -o $HOME/core
gdb testing_redistribution core
/cores
gdb
directly, the fact that they are compressed with zstandard saves space, but our gdb version cannot read that, maybe we need a newer version of the gdb spack? We had a similar setup on Saturn that can probably be replicated here. In particular the scratch filesystem was using relaxed locking/consistency semantics to avoid NFS freaking out when written to from multiple nodes at the same time.
@abouteiller