leginon-org / leginon-redmine-archive

1 stars 0 forks source link

spider consuming all the resources #1482

Open leginonbot opened 8 months ago

leginonbot commented 8 months ago

Author Name: Scott Stagg (Scott Stagg) Original Redmine Issue: 1482, https://emg.nysbc.org/redmine/issues/1482 Original Date: 2011-11-23


Spider processes that are launched from appion are are expanding to take up all of the processors on the node from which they are launched. This is problematic for two reasons. First, it takes up all the processing capabilities, and second it actually slows spider down significantly when the process it is working on isn't very parallelizable. The two places I've noticed this behavior are 1) when it is filtering an image during picking and 2) when doing an alignment of particles to a template. The behavior is the same for both spider 15 and 19.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Scott Stagg (Scott Stagg) Original Date: 2012-04-04T12:51:03Z


OK. I figured this out, but don't have time to fix it properly. The problem is that the spider functions which are passed through spyder have an nproc attribute, and nproc is being set to apParam.getNumProcessors() instead of a reasonable value. See the below snippet from apSpider/backproject.py:

def backprojectCG(stackfile, eulerdocfile, volfile, numpart, pixrad, dataext=".spi"):
    """
    inputs:
        stack, in spider format
        eulerdocfile
    outputs:
        volume
    """
    ### setup
    starttime = time.time()
    stackfile = spyder.fileFilter(stackfile)
    eulerdocfile = spyder.fileFilter(eulerdocfile)
    volfile = spyder.fileFilter(volfile)
    if not os.path.isfile(stackfile+dataext):
        apDisplay.printError("stack file not found: "+stackfile+dataext)
    if not os.path.isfile(eulerdocfile+dataext):
        apDisplay.printError("euler doc file not found: "+eulerdocfile+dataext)
    apFile.removeFile(volfile+dataext)
#   nproc = apParam.getNumProcessors()
    nproc = 2
    mySpider = spyder.SpiderSession(dataext=dataext, logo=True, nproc=nproc, log=False)
    mySpider.toSpider("BP CG", 

You can see my workaround where I arbitrarily set nproc to 2, and it solved my problem. There are several places in apSpider where the same problem exists (alignment.py, backproject.py, filters.py, and maybe others). The way it is currently implemented is problematic for many reasons. 1) You cannot specify how many processors to use. 2) It only works for systems with small numbers of processors. I think y'all are not seeing the problem as I am because you are using a resource manager (PBS) which limits the resources that a given process consumes. I upgraded the priority for this bug because I think it is critical that y'all get this worked out before the next release because not everyone will be using a resource manager. It is a big problem for me here because we are not using a resource manager and our processing boxes have 16 processors each. The processes that use spider eat up all 16 procs. The reason I'm not just fixing it myself is that it's going to require a lot of changes. For instance, all the programs that use the apSpider functions will need nproc command line arguments and corresponding webpage options. Alternatively, we should just set nproc to 1 and lose threading for the next release.

Thoughts?

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Gabriel Lander (@gclander) Original Date: 2012-04-04T13:32:07Z


I think we need to figure out which functions don't work well with parallelization, and specifically set the "nproc" for those when we start the spider session. So for Scott's example, it would be:

mySpider = spyder.SpiderSession(dataext=dataext, logo=True, nproc=1, log=False)

Whereas functions that are parallelizable would be use the nproc=apParam.getNumProcessors()

trouble is, how do we figure out which functions are parallelized?

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Scott Stagg (Scott Stagg) Original Date: 2012-04-04T16:05:25Z


I agree that we should determine which functions do and don't work well with parallelization, but I think we really absolutely have to have users specify the number of processors requested for a run. Otherwise you end up in the situation like I'm having where one job that doesn't need it eats up all the processors on a machine. For a while here it was using up 16 processors just to do a high pass filter for a micrograph. Think of the situation there are even more cores on one machine . . . 32 processors to do a high pass filter.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Anchi Cheng (@anchi2c) Original Date: 2012-04-05T01:10:46Z


Scott,

A point well made. I can think of several ways to patch this in the global settings of appionScript level. Give me a day to discuss with Jim and Amber before writing it in.

However, if it is set at appionScript, would it be enough for you now? some spyder function might need less than globally specified by appionScript, would that be a show stopper for you still? As Gabe said, a dictionary will be needed for each spyder call for the processor requirement. to have the internal job management as you are doing.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Scott Stagg (Scott Stagg) Original Date: 2012-04-05T18:31:49Z


Anchi, I think I see where you're going. Say, for instance you're doing a refinement and you want to use 64 processors for projection matching but you might not want to use all those processors for backprojection. It gets even more complicated. You might want to use MPI for projection matching but threading for backprojection.

I think setting nproc globally as an option at the appionScript level would be fine for now. You can leave it up to the user to be knowledgeable enough not to request an unreasonable number of procs.

BTW, I hardcoded nproc=2 for all occurances in backproject.py, and the time it takes to create an RCT volume went from >3 hours (using 16 procs) to ~20 minutes.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Anchi Cheng (@anchi2c) Original Date: 2012-04-10T01:10:53Z


r16466 and r16468 makes nproc a global option that can be also passed from apGenericJob used on qsub type of running.

Scott, please see if this idea would work for you. I have not changed all the spider functions yet. If you want to start on it, you are welcome to do it since I don't know which function should be changed right away. Unfortunately, I can't just change apParam.py to use self.params['nproc'] in getNumProcessors(). Wish we have done it in an object now....

Amber promised to do the web gui side once we have the python side ready.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2012-04-10T15:02:03Z


Anchi, I made an indentation syntax error fix in apAgent and also added a check in apGenericJob to avoid a NoneType + string error. I didn't look too closely why it is happening now, perhaps I need to add a new option to my commands? This is just a fix for now so I can run some things today, but please take a look at it to see if it should be removed once we have completed work on this issue.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Scott Stagg (Scott Stagg) Original Date: 2012-04-10T17:11:02Z


I can make the changes to the spider functions, but I'm not exactly sure when I'll get to it. I definitely won't do it until I am able to update to using the web to launch jobs that use the resource manager (qsub, msub, etc). Amber, when I visited, you said that I would be able to do this update relatively soon. Any ETA on this?

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Anchi Cheng (@anchi2c) Original Date: 2012-04-10T17:25:04Z


fixed the None problem in r16471 but found that if I do pass nproc according to nodes and ppn, appionScript will end up with those nproc values instead of their individual default, so I commented that part out in r16472 so not to break anything.

We will need to figure out how to sync the two sides some time so that we can truly use this function.

Also found the source of indentation error, Christopher uses spaces but I use tab.

You can assign this back to me now after review this change.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2012-04-10T19:51:26Z


Scott, just a few more bugs to get to and tests to run before we release. I missed almost 2 weeks out sick, so I'm bit behind where I had hoped to be. Should be within the next 2 weeks.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Anchi Cheng (@anchi2c) Original Date: 2012-04-11T19:23:55Z


Scott,

Talked to Jim a bit about this. Is it possible that the slow down by using more processors due to the memory allocation? Do you know if you assign more processors, each one was allowed (by spider?) to use not as much memory on you computer? Could it be that when you limit the nproc to 2 on this computer and because other processors are not used by another user, it gets to use all memory so that you get the big speed up? I got around large eman refinement on garibaldi before by doing something similar where I reserved, say, 10 nodes each at 8 ppn with memory restriction of 47gb (max on the nodes) and then in eman refine, I only asked for nproc=20.

Just trying to understand it better.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Scott Stagg (Scott Stagg) Original Date: 2012-04-11T23:18:22Z


No, the slow down is definitely not due to memory allocation. There are no competing jobs when I am running the spider commands, and there is plenty of available memory. I'm pretty sure it's due to the way that spider is doing the threading. In my mind it's analogous to something like washing a single dish. It is very efficient for 1 person to wash a single dish, but very inefficient for 16 people to wash a single dish all at the same time.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2012-04-24T20:08:04Z


Anchi, just reviewed this.

regarding this:

        if has_nproc and int(self.getNProc()) > nodeppn_nproc:
            raise Exception('You can not specify more nproc than nodes * ppn')
        else:
            # nproc need to be set according to the input nodes and ppn if not already in the command line
            self.setNProc(nodeppn_nproc)
            # We will not insert --nproc now so that those script that need different default can use
            # its own default.
            #if self.getNProc():
            #   newCommandLine.append('--nproc=%d' % (int(self.getNProc())))
        return newCommandLine

When we get to this conditional block, if nproc has been passed as an option, it has already been appended. An invalid passed in nproc will raise an exception here, but it looks like if the nproc is valid, once the comments are removed, nproc will be added again but the value will be nodeppn_nproc rather than the passed in value. This does not seem desirable.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Anchi Cheng (@anchi2c) Original Date: 2012-04-25T01:22:13Z


has_nproc will be true if --nproc is already in the command and it will not go to "else" part. Therefore, nothing will be appended.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2012-04-25T03:23:19Z


Yes, the has_nproc will be true, but the second part of the conditional "int(self.getNProc()) > nodeppn_nproc" must also be true for the else to not execute.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Anchi Cheng (@anchi2c) Original Date: 2012-04-25T15:57:32Z


r16555 should fix this logic once comments are taken off in the future.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2012-04-25T17:05:45Z


Perfecto.

leginonbot commented 8 months ago

Original Redmine Comment Author Name: Amber Herold (Amber Herold) Original Date: 2014-08-13T20:58:37Z


For some reason, there is no option to close this bug. Moving the priority to low.