salortiz / LMDB_File

Perl wrapper around the OpenLDAP's LMDB
Other
8 stars 12 forks source link

Possible memory leak #16

Closed akotlar closed 8 years ago

akotlar commented 8 years ago

Using the speedy transaction interface seems to lead to a memory leak. Iterating over millions of positions leads to gigabytes of reserved memory, even when no hanging reference to the data is possible.


  my $db = $self->_getDbi($chr);
  my $dbi = $db->{dbi};

  my $txn = $db->{env}->BeginTxn(MDB_RDONLY);

  my $json;

  for my $pos ( @allPositions ) {

    # Undefined values allowed

    #zero-copy
    $txn->get($dbi, $pos, undef);
}
$txn->commit();
hoytech commented 8 years ago

@akotlar - I'm not dismissing the possibility of a memory leak, but I'm just wondering how you are measuring the memory usage.

The way that LMDB works is that it mmap()s your entire database and pages in the parts of it that it needs. If there is a lot of free memory on the system then these pages will probably remain in the filesystem cache and therefore will show up as a part of your process's RSS (resident set size). Generally this is nothing to worry about -- this memory is not owned by your application and the OS will reclaim it as needed.

It might be helpful to see the output of pmap $PID where $PID is your process's PID before and after running the query. This will show the size of the heap and whether or not it has grown.

akotlar commented 8 years ago

I'm measuring memory by the growth of RES. The mmap size shows up in VIRT, and that is not what I'm talking about. I'll show pmap outputs as soon as possible.

This issue has cause the executing process to crash on a system with 60G ram. RES used without the

$txn->get($dbi, $pos, undef);

line is on the order of tens of megabytes, and does not grow. Will share more shortly.

Have also tried $txn->abort, as well as ->reset and ->renew. Doesn't affect behavior.

akotlar commented 8 years ago

Here is the code used in the query, the rest of the program is constant

sub dbPatchBulkArray {
  my ( $self, $chr, $posHref, $overrideOverwrite, $mergeFunc, $deleteOverride) = @_;

  my $db = $self->_getDbi($chr);
  my $dbi = $db->{dbi};
  my $txn = $db->{env}->BeginTxn(MDB_RDONLY);

  #https://ideone.com/Y0C4tX
  my $overwrite = $overrideOverwrite || $self->overwrite;
  my $delete = $deleteOverride || $self->delete;

  # We'll use goto to get to dbPutBulk, so store this in stack
  my @allPositions = @{ xsort([keys %$posHref]) };

  for my $pos ( @allPositions ) {
    #zero-copy
    $txn->get($dbi, $pos, my $json);

  }

  $txn->abort();
}

The $txn->get line is either executed or commented.

Ran each program for ~6 minutes. As you can see (pay attention to RES), the difference is not subtle, and run time differences are unlikely to be causative.

Commented (#$txn->get($dbi, $pos, my $json);): pmap_without_get.pdf screen shot 2016-08-12 at 1 36 08 pm

Uncommented($txn->get($dbi, $pos, my $json);): screen shot 2016-08-12 at 2 03 44 pm pmap_with_get.pdf

It would be tremendously useful to have a solution, because at the moment building genome databases is unfeasible with anything less than ~200GB RAM, unless building the database sequentially per chromosome. I'm not at all convinced that this isn't user error; if so, finding out what I'm doing incorrectly would be equally helpful.

From the pmap logs, almost the entire difference in memory usage is explained in the 2nd "anon" entry

flags used doesn't affect things in the desired direction: memory usage growth similar with MDB_NOTLS | MDB_NOMETASYNC , or no flags set.

similarly ReadMode(1) and ReadMode(0) both grow rapidly, with ReadMode(1) possibly growing somewhat more slowly (the above pmaps were both done with MDB_NOTLS|MDB_NOMETASYNC and ReadMode(1) ).

hoytech commented 8 years ago

Thanks for that. Based on that pmap output it does look as though the heap is growing in a leaky way. I'll see if I can reproduce it soon and if so I'll run it through valgrind.

Maybe @salortiz will have some ideas too.

akotlar commented 8 years ago

Thanks, let me know if I can help in any other way. Reducing the leak would help my current project out tremendously.

akotlar commented 8 years ago

Any luck? Could manually allowing closing via mdb_env_close be a temporary solution? I'm not versed in Perl XS, but if you or @salortiz could enumerate possible sources of the issue, happy to look there.

Also, for reproducing: I close the $txn used to open the db, immediately after opening it; then for the bulk read, open a read only transaction for the life of the loop, abort after all data read.

Also, each position holds <2048 bytes, 0 overflow pages.

I need to at least make the growth asymptotic to some few GB's, I'm concerned about database integrity.

hoytech commented 8 years ago

Sorry haven't had a chance yet: a lot on my plate ATM. But I do have some long-running processes that use this module so if there is a leak I'd like to solve it too...

akotlar commented 8 years ago

Below is a new function I wrote to demonstrate this, and to try to isolate environment from module code. The environment is opened once per function call, and then presumably closed as it goes out of scope. To further isolate the LMDB_File code, I also open and close the db once per loop. The behavior doesn't really change; memory usage goes up more slowly, but linearly up to system ram + swap; the function is of course slower. When commenting out $txn->get, 40M steady RES.

Note that the database it's trying to access has 0 bytes in it. This leak seems likely due to the fundamental LMDB_FILE::_get call, and not anything related to holding on to memory from the database.

My advisor and I looked at the XS code, but can't read it, neither familiar with C++, which is what it looks like.

We're trying to publish a paper that uses this module shortly so this issue is very urgent for us.

sub dbPatchBulkArray {
  my ( $self, $chr, $posHref, $overrideOverwrite, $mergeFunc, $deleteOverride) = @_;

  my $dbPath = $self->database_dir->child($chr);

  #create database unless dontCreate flag set
  if(!$dbPath->child($chr)->is_dir) {
    $dbPath->child($chr)->mkpath;
  }

  $dbPath = $dbPath->stringify;

  my $env = LMDB::Env->new($dbPath, {
    mapsize => 128 * 1024 * 1024 * 1024, # Plenty space, don't worry
    #maxdbs => 20, # Some databases
    mode   => 0600,
    #can't just use ternary that outputs 0 if not read only...
    #MDB_RDONLY can also be set per-transcation; it's just not mentioned 
    #in the docs
    flags => MDB_NOTLS | MDB_NOMETASYNC,
    maxreaders => 1000,
    maxdbs => 1, # Some databases; else we get a MDB_DBS_FULL error (max db limit reached)
  });

  for my $pos ( keys %$posHref ) {
    my $txn = $env->BeginTxn();

    my $DB = $txn->OpenDB();

    # ReadMode 1 gives memory pointer for perf reasons, not safe
    $DB->ReadMode(1);

    #zero-copy
    #$txn->get($DB->dbi, $pos, undef);

    $txn->abort();
    undef $txn;
  }
}

Tested with 5.24.0

hoytech commented 8 years ago

OK I tried some quick valgrinding, and in the case of a key that doesn't exist (as in your last post I believe) it seems there is a pretty straightforward leak.

I did a get for the key ASDF 10,000 times, and then POSIX::_exit(0)ed before $env was destroyed and this was in the valgrind --leak-check=full --show-leak-kinds=all output:

==10713== 480,000 bytes in 10,000 blocks are still reachable in loss record 1,432 of 1,434
==10713==    at 0x4C2CE8E: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10713==    by 0x4ECC0DF: Perl_safesysrealloc (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4EFEDE4: Perl_sv_grow (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4F05332: Perl_sv_catpvn_flags (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4EF63CC: Perl_sv_vcatpvfn_flags (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4EF7621: Perl_sv_vsetpvfn (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4EF76A4: Perl_vnewSVpvf (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4EF7773: Perl_newSVpvf_nocontext (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x73F5837: XS_LMDB_File__get (LMDB.xs:1038)
==10713==    by 0x4EF0C45: Perl_pp_entersub (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4EE9265: Perl_runops_standard (in /usr/lib/libperl.so.5.18.2)
==10713==    by 0x4E81A13: perl_run (in /usr/lib/libperl.so.5.18.2)

It looks like the ProcError macro (LMDB.xs:565) is creating an SV with newSVpvf to hold the MDB error message, and this is never getting cleaned up.

I'm pretty sure this is a bug and can be fixed easily: @salortiz - any idea on what to do in this case?

@akotlar - Did you say you were also seeing a leak when the key exists? Because I didn't see any obvious leaks in the case of getting ASDF 10,000 times when it did exist (with ReadMode either 0 or 1).

akotlar commented 8 years ago

Thanks @hoytech.

Yes in the case that the key exists it also looks like memory usage increases without bound. However, I will confirm tonight and send you pmap output. Hopefully I am incorrect, would be happy if part of this is user issue.

Also, checked against 5.24.0 and version .05. Issue still exists.

akotlar commented 8 years ago

By the way the amount of memory used after reading approx 2.8e9 positions is about 1.3e11 bytes.

If we say that 5e5 bytes are used for 1e4 positions, if we read 3e9 bytes we expect about 1.5e11 bytes. Right in the same ballpark if this log issue completely explains the problem.

Of course that ~130GB that those 2.8B reads generate is actually for sites that exist. I wonder if the memory for the error is allocated for every read, regardless of whether or not that read generates an error. Is that possible?

hoytech commented 8 years ago

would be happy if part of this is user issue.

IMO a leak like this would still be a bug in LMDB_File, not a user issue. Even if it's in the "cold" path it still shouldn't leak. :)

I wonder if the memory for the error is allocated for every read, regardless of whether or not that read generates an error.

In my small test case this didn't seem to be the case, but I haven't had time to do a deep dive yet.

Is it possible to generate a stand-alone program I can run that leaks? Here's what I have so far:

use strict;
use LMDB_File qw/:all/;
mkdir("tpdb");
my $env = LMDB::Env->new("tpdb/", {
    mapsize => 48 * 1024,
    mode   => 0600,
    flags => MDB_NOTLS | MDB_NOMETASYNC,
    maxreaders => 1000,
    maxdbs => 1,
});
{
    my $txn = $env->BeginTxn();
    my $DB = $txn->OpenDB();
    $DB->put("exists", "hello");
    $txn->commit();
}

for my $pos (0..1_000_000) {
    my $txn = $env->BeginTxn();
    my $DB = $txn->OpenDB();
    $DB->ReadMode(0);
    my $val;
    eval {
        #$txn->get($DB->dbi, "doesn't exist", $val); # leaks
        $txn->get($DB->dbi, "exists", $val); # no leak
    };

    die $@ if $@ && $@ !~ /^MDB_NOTFOUND/;
    $txn->commit();
}

When the "leaks" line is uncommented I observe an obvious memory leak, but not so with the "no leak" line.

Note I reduced the mapsize since valgrind has trouble with large file mappings but you can increase it if necessary.

salortiz commented 8 years ago

I've been busy because I'm moving home and office. Any way, I'm pushing a fix for the confirmed leak in ProcError, @hoytech can you give a try and confirm?

I'll try to dig more if time permits. No promises, sorry.

hoytech commented 8 years ago

@salortiz - Yup seems to fix the leak, thanks!

salortiz commented 8 years ago

@hoytech - Good to know that! Thank you I'll wait for @akotlar comments, fingers crossed!

akotlar commented 8 years ago

@hoytech @salortiz , you guys rock; the major leak has completely been solved. Memory growth is much slower now. Thank you very much.

I still experience ~ 3MB/s growth, which more than expected because I undef the data structures that I pass to the db insertion function immediately after insertion. I suspect either user error or another, smaller memory leak related to insertion.

I've pasted the functions used below, so that it may be clear whether this is user error. Testing insertion of 51M positions, data at each position on the order of 7-20bytes.


## ex: call dbPatchBulkArray, after instantiating the class:
$self->dbPatchBulkArray( 'chr22', { 1000 => { 5 => [0,1,2,3] } } );

#### The db functions of interest... mutation of input $posHref used to reduce memory pressure at cost of run time
sub dbPatchBulkArray {
  my ( $self, $chr, $posHref, $overrideOverwrite, $mergeFunc, $deleteOverride) = @_;

  my $db = $self->_getDbi($chr);
  my $dbi = $db->{dbi};
  my $txn = $db->{env}->BeginTxn(MDB_RDONLY);

  my $delete = $deleteOverride || $self->delete;

  # We'll use goto to get to dbPutBulk, so store this in stack
  my @allPositions = @{ xsort([keys %$posHref]) };

  for my $pos ( @allPositions ) {
    if(ref $posHref->{$pos} ne 'HASH') {
      return $self->log('fatal', "dbPatchBulkAsArray requires a 1-element hash of a hash");
    }

    my ($trackIndex, $trackValue) = %{ $posHref->{$pos} };

    if(!defined $trackIndex || ! looks_like_number($trackIndex) ) {
      $self->log('fatal', "dbPatchBulkAsArray requies numeric trackIndex");
    }

    #zero-copy
    $txn->get($dbi, $pos, my $json);

    if($LMDB_File::last_err && $LMDB_File::last_err != MDB_NOTFOUND) {
      $self->log('fatal', "dbPatchBulk LMDB error $LMDB_File::last_err");
    }

    my $aref = defined $json ? $mp->unpack($json) : [];

   if(!$delete) {
      $aref->[$trackIndex] = $trackValue;
      $posHref->{$pos} = $aref;
    } else {
      delete $posHref->{$pos};
    }
  }

  $txn->abort();

  #reset the class error variable, dangerous
  $LMDB_File::last_err = 0;

  $self->dbPutBulk($chr, $posHref, \@allPositions);
  undef %$posHref;
  undef @allPositions;
}

sub dbPutBulk {
  my ( $self, $chr, $posHref, $passedSortedPosAref) = @_;

  my $db = $self->_getDbi($chr);
  my $dbi = $db->{dbi};
  my $txn = $db->{env}->BeginTxn();

  for my $pos ( @{ $passedSortedPosAref || xsort( [keys %{$posHref}] ) } ) {
    if(!exists $posHref->{$pos}) { next; }

    $txn->put($dbi, $pos, $mp->pack( $posHref->{$pos} ) );

    if($LMDB_File::last_err && $LMDB_File::last_err != MDB_KEYEXIST) {
      $self->log('fatal', "dbPutBulk LMDB error: $LMDB_File::last_err");
    }
  }

  $txn->commit();

  $LMDB_File::last_err = 0;
}

sub _getDbi {
  my ($self, $name, $dontCreate) = @_;

  state $dbis;
  state $envs;

  return $dbis->{$name} if defined $dbis->{$name};

  my $dbPath = $self->database_dir->child($name);

  $dbPath = $dbPath->stringify;

  $envs->{$name} = $envs->{$name} ? $envs->{$name} : LMDB::Env->new($dbPath, {
    mapsize => 128 * 1024 * 1024 * 1024, # Plenty space, don't worry
    mode   => 0600,
    flags => MDB_NOTLS | MDB_NOMETASYNC,
    maxreaders => 1000,
    maxdbs => 1
  });

  if(!$envs->{$name} ) {
    return $self->log('fatal', "Failed to open environment because $LMDB_File::last_err");
  }

  my $txn = $envs->{$name}->BeginTxn();

  my $DB = $txn->OpenDB();

  # ReadMode 1 gives memory pointer for perf reasons, not safe
  $DB->ReadMode(1);

  # Now db is open
  $txn->commit();

  if($LMDB_File::last_err) {
    return $self->log('fatal', "Failed to open database beacuse of $LMDB_File::last_err");
  }

  $dbis->{$name} = {env => $envs->{$name}, dbi => $DB->dbi, path => $dbPath};

  return $dbis->{$name};
}
akotlar commented 8 years ago

Tested a different env/db open strategy, all else the same, same growth seen (again, only if insertion occurs, if I comment out $self->dbPutBulk(...) no growth seen)

sub dbPatchBulkArray {
  my ( $self, $chr, $posHref, $overrideOverwrite, $mergeFunc, $deleteOverride) = @_;

  my $dbPath = $self->database_dir->child($chr);

  if(!$dbPath->child($chr)->is_dir) {
      $dbPath->child($chr)->mkpath;
    }

  $dbPath = $dbPath->stringify;

  my $env = LMDB::Env->new($dbPath, {
    mapsize => 128 * 1024 * 1024 * 1024, # Plenty space, don't worry
    mode   => 0600,
    flags => MDB_NOTLS | MDB_NOMETASYNC,
    maxreaders => 1000,
    maxdbs => 1
  });

  my $txn = $env->BeginTxn();

  my $DB = $txn->OpenDB();

  # ReadMode 1 gives memory pointer for perf reasons, not safe
  $DB->ReadMode(1);

  #### all else same before
  $txn->get($DB->dbi, $pos, my $json);
  ### all else same after
hoytech commented 8 years ago

@akotlar - Glad to hear the leak is at least somewhat plugged.

Have you tried not undefing the data structures after use but instead waiting until the end of the function? It sounds paradoxical, but incrementally freeing small bits of memory as they are no longer needed might be creating memory holes that get filled up with small allocations. When the rest of the data-structure is freed, these small allocations might result in the next large allocation needing to acquire additional memory from the system in order to be filled, rather than using the freshly freed space.

Also I notice you are parsing JSON. You might consider a streaming parser such as JSON::SL or JSON::Streaming::Reader to avoid building up the entire object in memory at once.

A more drastic approach would be to re-architect your system so that the main program forks off a reader process to open the DB, parse the JSON, do any necessary pre-processing/fitlering and then stream the results back over a pipe to your main program which also opens the DB and does the insertions. The reader process could then exit, guaranteeing all memory is released. (LMDB is multi-process safe, but I wouldn't fork() an environment if I were you).

Finally, if you suspect that there may be a memory leak in your own program, there are modules to examine the perl heap such as Devel::Gladiator and Devel::DumpSizes. You might want to diff their output before and after running dbPatchBulkArray to see if anything unexpected is persisting.

Hope this helps!

akotlar commented 8 years ago

@hoytech. Great advice. I'll look into this further to give a more definitive answer as to whether my application layer is responsible.

Should we close this issue in the meantime, since @salortiz commit completely fixed the ->get() operation leak? If I can confirm that there is a leak during put operations that is unrelated to serialization, we can either re-open or make a new issue.

hoytech commented 8 years ago

For tracking down memory leaks I also forgot to mention Devel::FindRef -- this module is extremely useful for figuring out why perl objects aren't being freed.

hoytech commented 8 years ago

@akotlar -- If you believe the leak has been solved on your end, feel free to close this issue. You can always open a new one if you find another leak.

BTW: Thanks for bringing this leak to our attention.

akotlar commented 8 years ago

Thanks @hotech, @salortiz . This has been a tremendous help.