uafgeotools / mtuq

moment tensor uncertainty quantification
BSD 2-Clause "Simplified" License
65 stars 22 forks source link

C/C++ Pointer error when running container with apptainer #225

Open jthet opened 9 months ago

jthet commented 9 months ago

I've been getting errors when running the MTUQ container on TACC's frontera through apptainer. The errors have been indeterminant, however have always happened after the third "about 75 percent finished" message. See below for the std out, but I have also gotten error like malloc(): invalid size (unsorted) , double free or corruption (out) , corrupted size vs. prev_size in fastbins

The sif image was freshly pulled and it is the newest version.

c202-001[clx](423)$ APPTAINERENV_SYNGINE_CACHE=syngine_output ibrun apptainer run mtuq_ubuntu20.04.sif python3 /home/scoped/mtuq/examples/DetailedAnalysis.py
TACC:  Starting up job 5796560 
TACC:  Starting parallel tasks... 
  about 0 percent finished
  about 25 percent finished
  about 50 percent finished
  about 75 percent finished
  about 0 percent finished
  about 25 percent finished
  about 50 percent finished
  about 75 percent finished
  about 0 percent finished
  about 25 percent finished
  about 50 percent finished
  about 75 percent finished
free(): invalid pointer
rmodrak commented 9 months ago

Thanks for reporting this issue, which I wasn't aware of previously.

The progress messages you mentioned are from the following Cython function: https://github.com/uafgeotools/mtuq/blob/master/mtuq/misfit/waveform/c_ext_L2.c

In this function, it appears that most or all of the memory allocation/deallocation occurs through the Numpy API.

To start, it is probably worth double checking the NumPy API is being used correctly.

Also, it may be worth double checking this module intialization by comparing it against the Cython docs.

I am hoping that a software developer at my workplace might be able to start looking at the issue in October, but anyone is welcome to try troubleshooting.

In the meantime, if you create the misfit function usingWaveformMisfit(optimization_level=1, ...), then mtuq falls back to a slower pure Python implementation in which the Cython extensions are not called.

rmodrak commented 9 months ago

As expected for such a generic error message, free(): invalid pointer brings a very large number of stackoverflow and other search results.

Interestingly though, many of the top results appear to be Cython related, including a still apparently unresolved Pytorch issue, for example.