pace-neutrons / Horace

Horace is a suite of programs for the visualization and analysis of large datasets from time-of-flight neutron inelastic scattering spectrometers.
https://pace-neutrons.github.io/Horace/stable/
GNU General Public License v3.0
7 stars 4 forks source link

Out-of-memory error crashing DAaaS #1654

Closed mducle closed 4 months ago

mducle commented 4 months ago

A user on MAPS wanted very fine energy step sizes (0.1% of Ei - our typical setting is 0.25% of Ei). This resulted in rather large nxspe file sizes (~1GB per file) and caused Matlab request too much memory from the kernel. As the Linux kernel has a policy to overcommit memory and we typically run Mantid together with Horace, occasionally both Mantid and Horace will actually use (access) the memory they allocated (which the kernel overcommited). This results in a kernel panic and the DAaaS VM (workspace) rebooting.

The crashes are happening during the tmp construction step of gen_sqw. In some cases we're seeing memory allocations spiking by around 50GB. The crashes stop when running in serial mode or with fewer (e.g. 2 instead of 6) workers. As a temporary solution, we're setting it so that the kernel will try to kill Matlab first if it accesses overcommited memory rather than panic.

In the longer term it would be good to try to profile the code to see where memory could be saved in the gen_sqw process.

abuts commented 4 months ago

fixed through Re #1684 and by IDAAAS team providing different FS driver