leginon-org / leginon-redmine-archive

1 stars 0 forks source link

Make stack has frequent memory leaks which cause fatal errors #2295

Closed leginonbot closed 6 months ago

leginonbot commented 6 months ago

Author Name: Melody Campbell (Melody Campbell) Original Redmine Issue: 2295, https://emg.nysbc.org/redmine/issues/2295 Original Date: 2013-03-15 Original Assignee: Dmitry Lyumkis


About 50% of the stacks I have been creating lately have a memory leak fatal error and crash before they have finished. This never happened before but has happened about 4 times in the past week. I save the job file, delete the contents of the directory and resubmit the exact same job file, and this will go to completion. Below is the error. Thanks!

Traceback (most recent call last): File "/opt/myamisnap/bin/makestack2.py", line 1182, in makeStack.run() File "/opt/myamisnap/lib/appionlib/appionLoop2.py", line 62, in run if not self._startLoop(imgdata): File "/opt/myamisnap/lib/appionlib/appionLoop2.py", line 550, in _startLoop self._checkMemLeak() File "/opt/myamisnap/lib/appionlib/appionLoop2.py", line 661, in _checkMemLeak apDisplay.printError("Memory leak of "+str(round(memleak,2))+"MB") File "/opt/myamisnap/lib/appionlib/apDisplay.py", line 57, in printError raise Exception, colorString("\n FATAL ERROR \n"+text+"\n\a","red") Exception: FATAL ERROR Memory leak of 528.53MB

Fri Mar 15 14:46:41 2013: run start1 Fri Mar 15 14:46:51 2013: thread done Fri Mar 15 14:46:41 2013: run start1 Fri Mar 15 14:46:51 2013: thread done [melody@guppy stack-n-224box]$

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Ryan Hoffman (Ryan Hoffman) Original Date: 2013-03-21T14:58:23Z


I also got this error with DoG picking, on Garibaldi. It ran to around 80% completion before dying this way. Both makestack2 and DoGpicker call appionLoop2.py:

Traceback (most recent call last):
  File "/opt/applications/myami/2.2/bin/dogPicker.py", line 78, in <module>
    imgLoop.run()
  File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/appionLoop2.py", line 61, in run
    if not self._startLoop(imgdata):
  File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/appionLoop2.py", line 548, in _startLoop
    self._checkMemLeak()
  File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/appionLoop2.py", line 659, in _checkMemLeak
    apDisplay.printError("Memory leak of "+str(round(memleak,2))+"MB")
  File "/opt/applications/myami/2.2/lib/python2.6/site-packages/appionlib/apDisplay.py", line 57, in printError
    raise Exception, colorString("\n *** FATAL ERROR ***\n"+text+"\n\a","red")
Exception: 
 *** FATAL ERROR ***
Memory leak of 554.22MB
leginonbot commented 6 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2013-03-21T16:35:10Z


Opps, sorry guys, I meant to look into this last week. I've put it on my calendar to do this Monday.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2013-03-25T20:24:07Z


Info from Dipa on another failed dog run:

Here are the directories for my two DoG picking runs that failed due to
memory leaks.

/ami/data15/appion/13mar21a/extract/dogrun2
/ami/data15/appion/13mar21b/extract/dogrun2

They were running at the same time, and both failed at the same time
(10:10AM yesterday morning).

I checked the system log. Dmitry was running a frealign reconstruction during that time on the same guppy node. Also, someone ran xmipp cl2d at 6:30 in the morning and there was an out of memory event from xmipp. Attaching a copy of the log info.

I'm trying to reproduce the mem leak error by running dog picking or makestack on the same node as a cl2d at the same time but no luck. Also tried running at the same time as a dd make stack using start frame = 2 as Anchi suggested to ensure her recent changes were not at fault.

It would be really helpful if people could leave their run directory in tact after seeing this error and add the run directory to this issue so I can get more info! Thanks!

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Ryan Hoffman (Ryan Hoffman) Original Date: 2013-03-26T14:05:28Z


Here's the directory for the crashed DoG Picker run I mentioned:

@ /ami/data15/appion/13mar14d/extract/dogrun1 @

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2013-03-27T14:54:57Z


Everyone, (Jim, Neil, Sargis - your input would be helpful) It would be useful to have more examples of this bug happening on guppy to determine the exact cause. If you encounter a memory leak, please add it to this issue, noting the run directory. Then make sure to leave the directory intact - don't use the same run name for subsequent runs. Garibaldi cases are good to, but I may need to get JC involved to examine logs.

What I can tell so far:

  1. This error occurs because Appion is checking the memory logs on the processing node at /proc/meminfo. This information is not just for the Appion job running, but all processes on the node. So the memory leak could be caused by other jobs that are using the same node. For the failure that I could trace (Dipas), Frealign was also running on the node at the same time. A CL2D job was also terminated a few hours earlier by the linux Out-Of-Memory Killer, also during the Frealign run. Apparently, jobs being stopped by the OOM killer is supposed to be extremely rare since virtual memory has been in use.
  2. This makes me think that perhaps Frealign is using memory in a way that looks like a memory leak to the Appion functions that looks for it. Neil, this was your function, lots of un-commented numbers and math. I wonder if this function is actually finding a memory leak, or if the function is not designed for the way the node is currently being used, and it just looks like a memory leak.

    
    #=====================
    def _checkMemLeak(self):
        """
        unnecessary code for determining if the program is eating memory over time
        """
        ### Memory leak code:
        #self.stats['memlist'].append(mem.mySize()/1024)
        self.stats['memlist'].append(mem.active())
        memfree = mem.free()
        swapfree = mem.swapfree()
        minavailmem = 64*1024; # 64 MB, size of one image
        if(memfree < minavailmem):
            apDisplay.printError("Memory is low ("+str(int(memfree/1024))+"MB): there is probably a memory leak")
    
        if(self.stats['count'] > 15):
            memlist = self.stats['memlist'][-15:]
            n       = len(memlist)
    
            gain    = (memlist[n-1] - memlist[0])/1024.0
            sumx    = n*(n-1.0)/2.0
            sumxsq  = n*(n-1.0)*(2.0*n-1.0)/6.0
            sumy = 0.0; sumxy = 0.0; sumysq = 0.0
            for i in range(n):
                value  = float(memlist[i])/1024.0
                sumxy  += float(i)*value
                sumy   += value
                sumysq += value**2
            ###
            stdx  = math.sqrt(n*sumxsq - sumx**2)
            stdy  = math.sqrt(n*sumysq - sumy**2)
            rho   = float(n*sumxy - sumx*sumy)/float(stdx*stdy+1e-6)
            slope = float(n*sumxy - sumx*sumy)/float(n*sumxsq - sumx*sumx)
            memleak = rho*slope
            ###
            if(self.stats['memleak'] > 3 and slope > 20 and memleak > 512 and gain > 2048):
                apDisplay.printError("Memory leak of "+str(round(memleak,2))+"MB")
            elif(memleak > 32):
                self.stats['memleak'] += 1
                apDisplay.printWarning("substantial memory leak "+str(round(memleak,2))+"MB")
                print "(",str(n),round(slope,5),round(rho,5),round(gain,2),")"

1. Options to fix this:
## Run Frealign, Xmipp, and maybe dog picker and makestack against a memory analyzer to see if there is actually a memory leak. For Xmipp, this might require recompiling for debug mode, and installing analyzers. Jim, do you know how to analyze the python programs for a mem leak? I have not looked into Frealign, is it written in C?
## Rewrite the mem.py function that reads info from /proc/meminfo to use the job specific memory data instead. Neil, thoughts on that?
## Comment out the memory check and rely on the OOM Killer to do its job. The downside is that there is no indication what the error was when a job dies (without root access anyway), which is what we currently see with xmipp CL2D. The upside is that the OOM will use a highly tuned algorithm to chose which job dies, and our dog runs and make stack runs would likely survive.
1. On a side note, during our cronus2 crisis meeting with JC, do I remember JC telling us that when we run CL2D we need to make sure we ask for all the processors on the node to make sure it has enough memory?

Any other thoughts from anyone? Any other ideas for debugging/fixing from Jim or Sargis?
Thanks!
leginonbot commented 6 months ago

Original Redmine Comment Author Name: Ryan Hoffman (Ryan Hoffman) Original Date: 2013-03-27T15:19:14Z


??On a side note, during our cronus2 crisis meeting with JC, do I remember JC telling us that when we run CL2D we need to make sure we ask for all the processors on the node to make sure it has enough memory??

This seems to help with my Garibaldi runs -- if I request the maximum 47 GB per node (even if only using one core) this excludes other processes from using the node and the memory leak never seems to happen.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Ryan Hoffman (Ryan Hoffman) Original Date: 2013-03-27T15:47:41Z


This is regarding the function checkMemLeak.

The first message, which is an Appion error, @Memory is low ("+str(int(memfree/1024))+"MB): there is probably a memory leak")@, never appears in my log files. So there's always more than 64 MB of available memory. Not very useful finding, except for implying that mem.free() is reporting a reasonable value.

The third message, which is an Appion warning, @"substantial memory leak "+str(round(memleak,2))+"MB")@, very frequently appears in my log files. It appears in runs that do not crash as well as runs that do crash.

The second message, an Appion error, @"Memory leak of "+str(round(memleak,2))+"MB"@, only appears for the runs that actually crash. I find its conditional to be particularly opaque;

if(self.stats['memleak'] > 3 and slope > 20 and memleak > 512 and gain > 2048)

Hopefully someone can discern its meaning.

I don't understand the motivation behind the definitions of @rho@, @gain@, and @slope@, but it seems that that the condition @(rho*slope)>32@ frequently happens without compromising the run.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Neil Voss (@vosslab) Original Date: 2013-03-27T19:09:18Z


It is a pretty crazy if statement I agree:

a large slope implies memory leak (note this was a HUGE issue with sinedon bugs in 2007), but if the correlation coefficient is low that it is likely an outlier. So, multiplying rho*slope reduced the false positives for memory leaks.

Since we have not had memory leaks in a long time, we might as well disable this functionality unless a debug flag is used.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Neil Voss (@vosslab) Original Date: 2013-03-27T19:13:15Z


I've changed the Errors into Warnings, so all runs will continue despite having memory leaks.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2013-03-27T19:39:32Z


Thanks Neil, It sounds like this function can have false positives? I was going to change the error to warning as well, but if there is truly a memory leak I'd like to know about it. Do you think there is anything we should still be concerned about? Do you think Frealign or cl2d may have a memory leak based on what we've seen? Or do you think everything is probably working as it should be?

Ryan, the reason you see the program die for some of the messages and not all of the messages, is that the "printWarning()" function just print a warning, but the printError() function actually throws an exception in addition to printing the error. Yes, this is poorly named, the function should be named something like terminate() that takes a termination message as a parameter.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2013-03-27T19:52:05Z


Reviewed code and moving this to In Test. If we start to see Appion jobs die for no reason (killed by OOM), we may want to revisit this.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Dmitry Lyumkis (@LyumkisLab) Original Date: 2013-04-02T12:35:22Z


It is likely that the reason for these memory leaks is that our Xmipp parallelization is not setup correctly. When Amber and I checked for what was happening that last time this issue was occuring, we noticed that the node that had the memory leak was running 16 instances of CL2D. I just ran CL2D, and noticed that every CL2D parallel thread was being launched on guppy-11, whereas the rest of the nodes had nothing running on them. This additionally means that if we run on guppy, we are currently limited to 8 proc for CL2D, which is certainly not enough. It will be useful to check that our MPI implementation (/usr/lib64/openmpi/bin/mpirun) is compatible with Xmipp and is capable of spooling jobs across nodes.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Sargis Dallakyan (@dallakyan) Original Date: 2013-04-16T21:45:19Z


Open MPI that comes with CentOS 6 is not build with Torque support. That's why mpirun wasn't able to distribute jobs to different nodes. I have found an FAQ at http://www.open-mpi.org/faq/?category=building#build-rte-tm and run rpmbuild with openmpi-1.6.4-1.src.rpm to enable Torque support for Open MPI:

yum install torque-devel.x86_64
rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define 'configure_options --with-tm'
cd /root/rpmbuild/RPMS/x86_64
rpm -Uvh --force openmpi-1.6.4-1.x86_64.rpm
~/safadmin  "rpm -Uvh --force openmpi-1.6.4-1.x86_64.rpm" 

I have run a test job with xmipp_mpi_class_averages that Dmitry showed me and it seems that mpirun is now capable of spooling jobs across nodes.

If someone can independently test and close this Bug that would be great. Thank you all.

leginonbot commented 6 months ago

Original Redmine Comment Author Name: Dmitry Lyumkis (@LyumkisLab) Original Date: 2013-04-16T22:04:57Z


Just tried an independent job with Xmipp CL2D, and it is spooling jobs correctly. We should keep an eye out for other types of runs.