Open aizvorski opened 2 years ago
Update: the exact number of atoms which causes the crash is 834. The number of orbitals doesn't seem to matter, it really is atoms.
Works: 833 helium atoms https://gist.github.com/aizvorski/a6616970339d8447a98989b4d0455db8#file-helium833-xyz
Crashes: 834 helium atoms https://gist.github.com/aizvorski/b7b65913c1a52379937afc76b38c3450#file-helium834-xyz
This works fine for me, once I set 'ulimit -s unlimited' and 'export OMP_STACKSIZE=4G' xtb he834.xyz --namespcae test
* xtb version 6.5.1 (579679a) compiled by 'ehlert@majestix' on 2022-07-11
...
------------------------------------------------------------------------
* finished run on 2022/10/04 at 09:11:23.662
------------------------------------------------------------------------
total:
* wall-time: 0 d, 0 h, 0 min, 12.147 sec
* cpu-time: 0 d, 0 h, 0 min, 58.211 sec
* ratio c/w: 4.792 speedup
SCF:
* wall-time: 0 d, 0 h, 0 min, 11.557 sec
* cpu-time: 0 d, 0 h, 0 min, 55.357 sec
* ratio c/w: 4.790 speedup
normal termination of xtb
@haneug I can confirm this, the process stack in ulimit -s was the limiting factor. ulimit -s unlimited
works.
I think it's fair to say any SIGSEGV crash is a bug, since it is impossible to distinguish it from other bugs like out of bounds pointer, and there is no indication to the user what it is necessary to do to make the calculation succeed.
Since this is likely to be a thing a lot of folks run into, I'm going to suggest one of two things:
While educating the users on this setting seems error prone there are not really much alternatives, or better put, not many universal solutions. A simple band-aid solution could be a shell wrapper around xtb
which sets those values by default.
Back to the problem. So far I found a solution for MacOS (using -Wl,-stacksize,0x1000000
) and Windows (using /STACK:16777216
).
On Linux we have the possibility to use a system call getrlimit(2)
/ setrlimit(2)
to retrieve the current stack limit and warn the user if it not sufficient (note that system call here does not refer to Fortran's call system
but usage of a function from the Linux kernel). I don't know whether setrlimit(2)
is sufficient to increase the stack size at runtime, this sounds like something a process should not be allowed to do without elevated permissions, but maybe worth a try.
The OpenMP stack size issue is more severe, so far I found no good way to detect a too small stack. However, I believe this is a problem that can be solved on the algorithm side, for example I could restructure most OpenMP regions in s-dftd3
to not put large arrays on the OpenMP stack, which almost completely eliminates issues with stack overflows on both the system or OpenMP stack. Might be a way for xtb
as well. The implementation however gets somewhat more verbose about memory allocations.
Regarding stack usage, there is many insightful discussions on the use of stack vs. heap arrays in the Fortran discourse:
That issue actually comes up a lot, not only in xtb
. The only surefire method so far seems to avoid putting any large arrays on any stack but rather do the heap allocation explicitly.
@awvwgk Thanks, that's a good collection of links! I don't know too much about Fortran specifically, but perhaps using some compiler feature to avoid large arrays on the stack (without having to modify code) might work.
What compiler are release xtb binaries compiled with now?
It looks like gfortran doesn't yet have any way of doing this, but Intel ifx -heap-arrays [size]
(docs) and NVIDIA/PGI nvfortran -Mnostack_arrays
(docs) might do the job.
(Bonus: ifx and nvfortran can both compile OpenMP code to run on GPU :)
@awvwgk About OMP_STACKSIZE: the compiler options to reduce stack use may also apply to OpenMP code, but if not, maybe we could default to OMP_STACKSIZE=physical memory/number of threads? That's only if OMP_STACKSIZE environment variable is unset of course; if it is set, then use the value and maybe warn if it is low.
It looks like gfortran doesn't yet have any way of doing this, but Intel
ifx -heap-arrays [size]
(docs) and NVIDIA/PGInvfortran -Mnostack_arrays
(docs) might do the job.
Those apply to automatic arrays. Since we don't use automatic arrays in xtb
, the option to put them on the heap will not change the program behavior. Maybe providing a custom allocator in the OpenMP directive might do the trick.
(Bonus: ifx and nvfortran can both compile OpenMP code to run on GPU :)
I'm really looking forward to see the first LLVM based Fortran compiler working for a code base using moderately new Fortran features (F2003+).
I found how to fix this bug for Windows.
1) Install MVSC
2) Use Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\Hostx64\x64\editbin.exe
to patch xtb.exe
editbin.exe /STACK:64000000 xtb.exe
Describe the bug Systems larger than approx 831-833 atoms always crash. This doesn't seem to depend on what the systems are (tried a few different types of systems, from one long linear molecule to many small ones with different atoms, all behave the same), and also doesn't depend on the coordinates (molecules near each other in different orientations, or very far apart). It also doesn’t seem related to the OpenMP stack size.
To Reproduce Using the provided water278.xyz file: https://gist.github.com/aizvorski/641a987e7dfa89eba4ce241c68409768#file-water278-xyz
For comparison, an input file water277.xyz with one less water succeeds: https://gist.github.com/aizvorski/7b4215388491126090ba83b6ae4ab341#file-water277-xyz
This does not appear to be due to out-of-memory, or to too-low setting for OMP_STACKSIZE. The machine this was tested on has >200GB memory. The actual memory used when the crash happens (reported by time -v) is just a little over 100MB.
Setting the stack size deliberately very low with largest input system which succeeds, water277.xyz:
OMP_STACKSIZE=1M OMP_NUM_THREADS=1
succeeds - the stack size seems to not matter when there is only one threadOMP_STACKSIZE=50M OMP_NUM_THREADS=2
succeedsOMP_STACKSIZE=20M OMP_NUM_THREADS=2
fails, the exact failure seems non-deterministic - either SIGSEGV in xtb_coulomb_klopm during the iterations, or "Command terminated by signal 11" after iterations finishGDB backtrace:
Expected behaviour No crash.
Additional context
Using xtb 6.5.1 binary downloaded from https://github.com/grimme-lab/xtb/releases/download/v6.5.1/xtb-6.5.1-linux-x86_64.tar.xz xtb --version gives
version 6.5.1 (579679a) compiled by 'ehlert@majestix' on 2022-07-11
OS: Ubuntu 18.04.4 LTS Hardware: AMD EPYC 7B13 CPU, 224GB RAM (also tested on Ubuntu 20.04 LTS, Intel i7-10510U, 48GB RAM: same behavior) (also tested on xtb-6.5.0 and 6.4.1: same)