mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
32 stars 12 forks source link

Memory Leaks #80

Open pavlis opened 4 years ago

pavlis commented 4 years ago

I knew this one was coming and we finally encountered it. As any experienced reader knows one of the dark sides of any C++ bindings to python is memory management. The two languages treat that problem very differently so all binds have to cleanly separate C++ memory management from python memory management.

I encountered a memory management problem in the code below that I eventually fixed, but it illuminated some issues we should be conscious of and work on as we progress in this project.

This block of code is a miniseed data file reader driven by a document created by indexing it earlier. I'll just dump the full code here and then describe the problem I'm seeing:

def load_one_ensemble(doc=None,
                  collection="seed_data.ensemble",
                  create_history=False,
                  jobname='Default job',
                  jobid='99999',
                  algid='99999',
                  ensemble_mdkeys=[],
                  verbose=False):
    """
    Version 2 prototype.   This one loads one ensemble from a single
    document retrieved from MongoDB passed as the doc argument.
    It has required ensemble metadata and ensemble_mdkeys is now optional
    additional metadata to load.
    """
    base_error_message='load_one_ensemble: '
    try:
        if doc==None:
            raise MsPASSError(base_error_message+'missing required argument=doc.  see help(load_one_ensemble)',
                              ErrorSeverity.Fatal)
        if create_history:
            his=ProcessingHistory(jobname,jobid)
            # use the objectid string as the uuid for the origin definition
            # of all data in this ensemble
            history_uuid=str(doc['_id'])
            # all TimeSeries members of this ensemble will get a copy
            # of this top level history definition.  Correct since they
            # all come from a single file defined by the ObjecID
            his.set_as_origin('load_ensemble',algid,history_uuid,
                      AtomicType.TIMESERIES,True)
        form=doc['format']
        mover=doc['mover']
        if form!='mseed':
            raise MsPASSError("Cannot handle this ensemble - ensemble format="+form+"\nCan only be mseed for this reader")
        if mover!='obspy_seed_ensemble_reader':
            raise MsPASSError("Cannot handle this ensemble - ensemble mover parameter="+mover+" which is not supported")
        dir=doc['dir']
        dfile=doc['dfile']
        fname=dir+"/"+dfile
        # Note this algorithm actually should work with any format
        # supported by obspy's read function - should generalize it for release
        dseis=read(fname,format='mseed')
        # We create a temporary set container as a simple way to avoid
        # duplicate keys when we add required keys
        if len(ensemble_mdkeys)==0:
            stmp=set()
        else:
            stmp=set(ensemble_mdkeys)
        stmp.add('starttime')
        stmp.add('endtime')
        ensemblemd=load_md(doc,list(stmp))
        # There is a Stream2TimeSeriesEnsemble function
        # but we don't use it here because we need some functionality
        # not found in that simple function
        nseis=len(dseis)
        result=TimeSeriesEnsemble(ensemblemd,nseis)
        # Secondary files get handled almost the same except for
        # a warning.   The warning message (hopefully) explains the
        # problem but our documentation must warn about his if this
        # prototype algorithm becomes the release version
        count=0
        for d in dseis:
            count+=1
            dts=Trace2TimeSeries(d)
            if create_history:
                dts.load_history(his)
            result.member.append(dts)

        return result
    except MsPASSError as merr:
        print(merr)
        return None

I'm trying to eat up and process these rather large miniseed files I produced from obspy's downloader. Each file is about a gig in size. When I run this function in a simple loop like this:

for doc in cursor:
    print('working on enemble number ',n)
    ens=load_one_ensemble(doc,
                  ensemble_mdkeys=['source_id','starttime','endtime'])

I can watch it eat up memory with the system monitor because each chunk is so huge. I seems the large block of memory in ens is not being released. I am pretty sure this is happening because ens is just a pointer to a block of memory defined, in this case, by a TimeSeriesEnsemble returned by the load_one_ensemble function. I may not understand python garbage collection correctly, but it doesn't work here quite the way I thought. After some a minor change where I added del ens at the bottom of this loop the memory problem went away.

There are two points I think follow from this:

  1. As a minimum we will need to give have a section in tutorials that advise that when working on many objects in a loop and especially if the loop spans a large range that they should avoid constructs like the above and/or use del as necessary manage memory manually.
  2. This is a reminder as we do enhanced testing in them months ahead we need to design tests for memory leaks in the pybind11 linked C++ code.
wangyinz commented 4 years ago

If you are able to reclaim the memory with the del call, there shouldn't be a memory leak. You might just be seeing the behavior of Python's garbage collection. So, I am not sure it is necessary to actively call del as that call is rarely seen in most of the Python codes. Also, if there is a memory leak, calling del would not fix it anyway.

I agree that we probably should be more careful about potential memory leaks. I guess the first thing we need to make sure of is that there is no memory leaks on the C++ side. This should be easier to address as long as we have a suite of test cases. It will be tricky to detect memory leaks from the python side. I don't know any tools specialized for that kind of purpose, but I guess any standard memory debugging tools should work. There is not much difference in attaching the debugger to a pure C++ program with attaching that to a Python program that calls C++ library under the hood. I just did a test with valgrind, and it behaves normal with our test_ccore code. I assume we can just use that to detect memory leaks. From that test code I run, valgrind did not find any leaks in its "definitely lost" and "indirectly lost" categories. I guess that is a good sign.

pavlis commented 4 years ago

There are plenty of ways to test for memory leaks on the pure C++ side from valgrind to more sophisticated debugging tools. I actually don't suspect that is a large issue, but there could be some small leaks. Having written most of the C++ code to this point I will state that I learned some basics ways of writing C++ code that make memory leaks rare to nonexistent. Basically, the point is to avoid pointers if at all possible and remember to avoid errors that create unclosed file errors.

What I'm more concerned here is the impedance mismatch between memory managed by C++ and python and how our pybind11 bindings do or do not handle that correctly. As I am sure you know pybind11 has features for handing this issue and they talk about it a lot in the documentation. Safest thing is to always make a deep copy going to and from the python side, but I don't think that always happens and sometimes you really don't want it to happen for efficiency. I guess the point is there are likely many subtle ways to create memory leaks or mysterious behavior in ccore functions because of the impedance mismatch in memory management pybind11 has to try to handle. We have to keep this in mind as we develop a comprehensive test suite.

wangyinz commented 4 years ago

Yep, I share the same concern, especially when I wrote some of those pybind11 code that interacts with the raw C python interface - it is pretty easy to screw up, and I believe those were implemented correctly as I have carefully tested the reference counters to ensure everything is correctly managed by the garbage collector. The pure pybind11 code is of less concern to me since if there is a memory leak there, it should be considered a pybind11 bug and we can (reasonably) assume it will be caught and resolved by the pybind11 users and developers. All we have to do is to keep the package up-to-date. I guess for now, we only need to take care of the lines that involves the raw C python interface. I probably should think of how to include a test for memory leaks in the test_ccore.py. That should be good enough for the current stage of development.