manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
165 stars 50 forks source link

Potential Memory corruption. #162

Closed rainwoodman closed 4 years ago

rainwoodman commented 6 years ago

nbodykit is crashing on travis, likely in CorrFunc 2.0.0.

I ran valgrind and this line looked suspicious. It appears that a variable defined locally is freed but yet passed though as return values into downstream Python (and MPI)

==15132== Invalid read of size 4
==15132==    at 0x4EBFE53: PyObject_Free (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x1586B5D7: __pyx_f_6mpi4py_3MPI_PyMPI_copy (in /home/yfeng1/anaconda3/install/envs/2.7/lib/python2.7/site-packages/mpi4py/MPI.so)
==15132==    by 0x1586BF67: __pyx_f_6mpi4py_3MPI_PyMPI_reduce_p2p (in /home/yfeng1/anaconda3/install/envs/2.7/lib/python2.7/site-packages/mpi4py/MPI.so)
==15132==    by 0x1588494C: __pyx_pw_6mpi4py_3MPI_4Comm_219allreduce (in /home/yfeng1/anaconda3/install/envs/2.7/lib/python2.7/site-packages/mpi4py/MPI.so)
==15132==    by 0x4F1BCC3: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4EA6376: function_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4E9063C: instancemethod_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F17B68: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==  Address 0x12ba3020 is 464 bytes inside a block of size 1,176 free'd
==15132==    at 0x4C2DD18: free (vg_replace_malloc.c:530)
==15132==    by 0x41F66633: free_cellarray_index_particles_double (gridlink_impl_double.c:76)
==15132==    by 0x41F55935: countpairs_double (countpairs_impl_double.c:527)
==15132==    by 0x41F46C8F: countpairs_countpairs (_countpairs.c:1349)
==15132==    by 0x4F1A20F: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4EA6376: function_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F164BD: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1BF9D: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1A9B7: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==  Block was alloc'd at
==15132==    at 0x4C2CB6B: malloc (vg_replace_malloc.c:299)
==15132==    by 0x41F6EC26: my_malloc (utils.c:497)
==15132==    by 0x41F69E74: assign_ngb_cells_index_particles_double (gridlink_impl_double.c:548)
==15132==    by 0x41F5538D: countpairs_double (countpairs_impl_double.c:299)
==15132==    by 0x41F46C8F: countpairs_countpairs (_countpairs.c:1349)
==15132==    by 0x4F1A20F: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4EA6376: function_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F164BD: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1BF9D: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==15132==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
rainwoodman commented 6 years ago

On 2.1.0rc1, we still have similar errors.

==26472== Invalid read of size 4
==26472==    at 0x4EBFE53: PyObject_Free (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4EA5FF8: PyFrame_ClearFreeList (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F4C39B: collect (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F4C970: _PyObject_GC_Malloc (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F4C9F6: _PyObject_GC_NewVar (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4ED15CF: PyTuple_New (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F33A4E: do_mktuple (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F3303E: do_mkvalue (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F33BDE: va_build_value (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F3400D: _Py_BuildValue_SizeT (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x39EFB221: countpairs_countpairs_s_mu (???:3260)
==26472==    by 0x4F1A20F: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==  Address 0xdf0a020 is 16 bytes after a block of size 16 free'd
==26472==    at 0x4C2DD18: free (vg_replace_malloc.c:530)
==26472==    by 0x39F1B2E5: free_cellarray_index_particles_double (gridlink_impl_double.c:69)
==26472==    by 0x39F31867: countpairs_rp_pi_double (countpairs_rp_pi_impl_double.c:509)
==26472==    by 0x39EFC66B: countpairs_countpairs_rp_pi (???:2464)
==26472==    by 0x4F1A20F: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4EA6376: function_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F164BD: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1BF9D: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1A9B7: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==  Block was alloc'd at
==26472==    at 0x4C2CB6B: malloc (vg_replace_malloc.c:299)
==26472==    by 0x39F23876: my_malloc (utils.c:497)
==26472==    by 0x39F1CB51: gridlink_index_particles_double (gridlink_impl_double.c:350)
==26472==    by 0x39F31A28: countpairs_rp_pi_double (countpairs_rp_pi_impl_double.c:269)
==26472==    by 0x39EFC66B: countpairs_countpairs_rp_pi (???:2464)
==26472==    by 0x4F1A20F: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4EA6376: function_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F164BD: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1BF9D: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==26472==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
lgarrison commented 6 years ago

Thanks for the report. I scanned the parts of the code referenced by the valgrind dump, but I didn't see anything obvious. It's a little strange because the errors are related to gridlink, but we never return any gridlink-allocated arrays to the Python level.

In the nbodykit failure, was the pair counting successful? That is, did the Python code return the correct value then crash at a later location, or did it crash before returning the results? We free things slightly differently on success or failure. A minimal reproducer would be very helpful for figuring this out.

Also, if you have the whole valgrind log handy it might be useful. It seems things are breaking in one part of the code from errors in another part, since in the second log the crash is in DDsmu but the alloc/frees are in DDrppi.

rainwoodman commented 6 years ago

The corrfunc test cases ran successful and the crashed was way later. I am still working on pinning this down, as it is not reproduced on any interactive shells yet. Given the second one involves memory of two modules, I think this is more likely a false positive.

manodeep commented 6 years ago

Yeah this is very strange. The reference to both DDsmu and DDrppi, the reference to gridlink from the python sections. @rainwoodman If you can work out a minimum failing example, that will be very helpful. (On a related note, I tried to make a template for creating an issue -- @rainwoodman did the template show up for you?)

For the return value to python from each pair-counter, every item in the returned python list is created with Py_BuildValue and then appended to the returned list with PyList_Append

rainwoodman commented 6 years ago

My crash was due to OOM; so there was no crash.

I only know how to trigger this from within nbodykit:

conda activate 2.7
conda install -c bccp nbodykit runtests nose python==2.7
git clone git@github.com:bccp/nbodykit
cd nbodykit
valgrind --log-file=log python run-tests.py nbodykit/algorithms/pair_counters/tests/test_1d.py --verbose --single

It should give you a log file like this.

log.txt

The error seems to be from test_bad_los, though when I isolate and run that test by itself, valgrind doesn't report any errors. The horde of warnings before the error is also suggesting it is a false positive.

manodeep commented 6 years ago

@rainwoodman Will you please add in the valgrind flags --leak-check=full --track-origins=yes and then attach that log file? May be that will shed some more light

If this is a valgrind issue, perhaps we can report that to the valgrind-devs

rainwoodman commented 6 years ago

Here is the full record. Looks like there is a leak.

log2.txt

manodeep commented 6 years ago

@rainwoodman Sorry, I haven't had any time to look at this issue. Have you had any further insights by any chance?

rainwoodman commented 6 years ago

Nope. Except it is still pointing to leaks in countpairs_double, like here:

==14377== 88 bytes in 1 blocks are definitely lost in loss record 1,894 of 3,945
==14377==    at 0x4C2EA1E: calloc (vg_replace_malloc.c:711)
==14377==    by 0x3922ECF5: my_calloc (utils.c:511)
==14377==    by 0x392308CD: setup_bins (utils.c:69)
==14377==    by 0x39213118: countpairs_double (countpairs_impl_double.c:191)
==14377==    by 0x39204930: countpairs_countpairs (???:2169)
==14377==    by 0x4F1A20F: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377==    by 0x4EA6376: function_call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377==    by 0x4E817A2: PyObject_Call (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377==    by 0x4F164BD: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377==    by 0x4F1BF9D: PyEval_EvalFrameEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377==    by 0x4F1D4E8: PyEval_EvalCodeEx (in /home/yfeng1/anaconda3/install/envs/2.7/lib/libpython2.7.so.1.0)
==14377== 
manodeep commented 6 years ago

If the code executes normally, then this line should free rupp. However, if the code returns EXIT_FAILURE, then there might be code-paths where rupp is not freed.

Is Corrfunc.theory.DD returning normally?

rainwoodman commented 6 years ago

I think it is not when this is triggered.

manodeep commented 6 years ago

I also see that there is a potential memory leak from gridlink.

@rainwoodman Will you please confirm that you do not see these memory leaks when directly running from the command-line without invoking python?

rainwoodman commented 6 years ago

I am not sure how -- these were triggered inside nbodykit where we have a test case that it fails. Is there a test case that I can follow for the cli?

manodeep commented 6 years ago

You have to create an ascii file with the datasets in question and then run the command-line with valgrind. If you are using double precision, then set DOUBLE_PREC in ROOTDIR/theory.options and then re-compile with make. That should give you the executable DD within theory/DD. Running DD without any arguments should tell you how to run the cli

manodeep commented 5 years ago

@rainwoodman We have fixed #181 in Corrfunc v2.3.0 - will you please check if this issue is fixed?

rainwoodman commented 5 years ago

Yay! This definitely looks relevant. I'll simply push out a new version to our channel.

manodeep commented 5 years ago

@rainwoodman Did you get a chance to confirm if the memory leak went away with v2.3?

manodeep commented 4 years ago

@rainwoodman I am closing this issue. If the memory leak still occurs, please feel free to re-open.