ssbc / ssb-db2

A new database for secure-scuttlebutt
47 stars 8 forks source link

Memory leak with v2.6.0 #264

Closed staltz closed 2 years ago

staltz commented 3 years ago

I tried to use ssb-db2 2.6.0 in Manyverse and experienced a consistent memory leak that ends up using all the memory and crashing the app.

I then pinned the version to 2.3.1 and no memory leak happened.

I'm not sure where exactly is the culprit, it could be related to bendy-butt additions (in version 2.4.0 and above) and/or transactions, or maybe it's caused by one of the dependencies (async-append-only-log or jitdb or something else). Update: actually when setting the ssb-db2 to 2.3.1, it also updated async-append-only-log to 3.1.0 (that's the latest) and jitdb to 3.4.0 (latest) and the mem leak didn't happen, so it has to be something about ssb-db2 specifically.

arj03 commented 3 years ago

It can't be the transaction support as that is a new function that you don't use yet. So must be the bendy butt PR. Was this related to or was it just normal usage of manyverse?

staltz commented 3 years ago

I think it waa unrelated to that issue, and it was just normal usage of Manyverse. Just opened the app and without doing anything, in about 15sec it would crash.

Could it be something about scanning the log and rebuilding the author index? (Note the prefixOffset hack we did)

arj03 commented 3 years ago

Could be yes

staltz commented 2 years ago

I went back to study this issue, and the symptoms are still the same with ssb-db2 2.8.3. The crash is clearly mentioning something about "out of memory" on both Android and desktop (Electron, Linux).

Here's a report from Electron DevTools:

Screenshot from 2021-12-22 13-10-45

So the big issue is unequivocally JITDB's mergeFilters. On the other hand, the previous JITDB version in Manyverse is 3.5.0 also has mergeFilters and matchAgainstPrefix but it doesn't have a memleak. Here is the git diff between jitdb 3.5.0 and 4.0.0: https://github.com/ssb-ngi-pointer/jitdb/compare/v.3.5.0..v4.0.0

The only thing that might have changed the logic is https://github.com/ssb-ngi-pointer/jitdb/compare/ssb-ngi-pointer:e004c1a...ssb-ngi-pointer:41a1b32#diff-e727e4bdf3657fd1d798edcd6b099d6e092f8573cba266154583a746bba0f346R850 which is now using bipf.compare.

That said, when I filed this issue (Oct 1), the latest version of jitdb was 3.4.0 which did not have bipf.compare in matchAgainstPrefix: https://github.com/ssb-ngi-pointer/jitdb/blob/e0744c187c193999e9c2f861ee8e6e9b547900a3/index.js#L856 and even back then there was a memleak somewhere.

staltz commented 2 years ago

I ran a couple other experiments:

Of course, with the public tab query turned off, the app is useless, AND there are typically other computations that start when the public tab renders data (such as ssb-conn is kickstarted), but I currently suspect it has something to do with this ssb-db2 query: https://github.com/ssbc/ssb-threads/blob/22170c3fde66cbdd4f7615a3b94ab2871209b0c3/src/index.ts#L188-L193

(UPDATE: I turned on ssb-conn's "start" and it did not cause a memleak crash, so this confirms further that it may be about ssb-threads publicSummary)

It logs in the terminal usually like this:

  jitdb:query all(AND(*,SEQS(1371356))): 6ms, total messages: 1 +15ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%Xrmwkht),value_content_fork(%Xrmwkht)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 1352ms, total messages: 0 +1s

  jitdb:query all(AND(*,SEQS(1371219))): 2ms, total messages: 1 +26ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%vrwiHED),value_content_fork(%vrwiHED)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 1351ms, total messages: 1 +1s

  jitdb:query all(AND(*,SEQS(1371273))): 1ms, total messages: 1 +301ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%RadDbNO),value_content_fork(%RadDbNO)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 921ms, total messages: 0 +947ms

  jitdb:query all(AND(*,SEQS(1371301))): 2ms, total messages: 1 +51ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%nRyWqKu),value_content_fork(%nRyWqKu)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 951ms, total messages: 0 +1s

  jitdb:query all(AND(*,SEQS(1371265))): 2ms, total messages: 1 +37ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%cgs6CeR),value_content_fork(%cgs6CeR)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 928ms, total messages: 0 +948ms

  jitdb:query all(AND(*,SEQS(1371264))): 1ms, total messages: 1 +108ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%0Ulq+Lh),value_content_fork(%0Ulq+Lh)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 929ms, total messages: 0 +937ms

  jitdb:query all(AND(*,SEQS(1371290))): 2ms, total messages: 1 +214ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%oZTNlUm),value_content_fork(%oZTNlUm)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 2231ms, total messages: 3 +2s

  jitdb:query all(AND(*,SEQS(1371246))): 1ms, total messages: 1 +465ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%3QsVgTQ),value_content_fork(%3QsVgTQ)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 1116ms, total messages: 0 +1s

  jitdb:query all(AND(*,SEQS(1371263))): 2ms, total messages: 1 +2s
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%Lc98Jti),value_content_fork(%Lc98Jti)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 878ms, total messages: 0 +925ms

  jitdb:query all(AND(*,SEQS(1370941))): 3ms, total messages: 1 +42ms
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%lDlTx1g),value_content_fork(%lDlTx1g)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 889ms, total messages: 2 +893ms

  jitdb:query all(AND(*,SEQS(1371057))): 3ms, total messages: 1 +3s
  jitdb:query paginate(AND(*,AND(OR(value_content_root(%XxTlFZX),value_content_fork(%XxTlFZX)),OR(value_content_type( post),value_content_type(8contact))))), seq: 0, limit: 15: 979ms, total messages: 0 +983ms

<--- Last few GCs --->

[467783:0xb880042c000]    63018 ms: Scavenge 1911.5 (1964.8) -> 1909.6 (1964.8) MB, 1.8 / 0.0 ms  (average mu = 0.265, current mu = 0.182) task 
[467783:0xb880042c000]    63038 ms: Scavenge 1911.5 (1964.8) -> 1909.6 (1964.8) MB, 1.4 / 0.0 ms  (average mu = 0.265, current mu = 0.182) allocation failure 
[467783:0xb880042c000]    63093 ms: Scavenge 1911.4 (1964.8) -> 1909.7 (1964.8) MB, 1.7 / 0.0 ms  (average mu = 0.265, current mu = 0.182) task 

<--- JS stacktrace --->

[467783:1222/162520.851176:ERROR:node_bindings.cc(143)] 
Fatal error in V8: Ineffective mark-compacts near heap limit 
Allocation failed - JavaScript heap out of memory
staltz commented 2 years ago

OMG I think I figured it out! GOSH programming is so painful sometimes.

I'm happy I made the thorough analysis above, because it helped narrow down that it may be in publicSummary and it has something to do with the new code shown in the diff between the ssb-db2 versions.

So I took a careful look at the changes to the operators, like type:

//                   | THE OPTS ARG IS NEW STUFF
//                   |
//                   v
function type(value, opts = { dedicated: true }) {
  if (opts && opts.dedicated) {
    return equal(seekType, value, {
      indexType: 'value_content_type',
    })
  } else {
    return equal(seekType, value, {
      prefix: 32,
      indexType: 'value_content_type',
    })
  }
}

And then makeFilterOperator in ssb-threads:

function makeFilterOperator(opts: FilterOpts): any {
  if (opts.allowlist) {
//                                      | THIS NASTY .map
//                                      |
//                                      v
    const allowedTypes = opts.allowlist.map(type);
    return or(...allowedTypes);
  }
  if (opts.blocklist) {
    const blockedTypes = opts.blocklist.map((x) => not(type(x)));
    return and(...blockedTypes);
  }
  return null;
}

The .map calls the type() operator with type(str, i, thisObj) and then we're passing i as the operator opts, which ends up creating a prefix operator, which is not what we want.

Turned out I just changed

-   const allowedTypes = opts.allowlist.map(type);
+   const allowedTypes = opts.allowlist.map(x => type(x));

And no more memory leak crash.

SO PAINFUL.

arj03 commented 2 years ago

Wow, nice debugging! :) That should things overall also smoother I would imagine.

staltz commented 2 years ago

Closing this issue because it's an ssb-threads issue and I solved it there already. Thanks for reading/following, I know it probably took some non-zero minutes of your attention.