ssbc / ssb-db2

A new database for secure-scuttlebutt
47 stars 8 forks source link

Migrate is insanely slow on mobile #193

Closed staltz closed 3 years ago

staltz commented 3 years ago

Same codebase running on desktop and on mobile (except for the third column on the table below)

Animated GIF of desktop logs ![desktop](https://user-images.githubusercontent.com/90512/108351801-aa678c00-71ee-11eb-9c0c-e3bec97b1501.gif)
Animated GIF of mobile logs ![mobile](https://user-images.githubusercontent.com/90512/108351815-ad627c80-71ee-11eb-8177-7b1479d72360.gif)
Desktop Mobile Mobile with maxCpu=98
log.offset size 1011 MB 787 MB 787 MB
Migrate speed +10% in 7s +10% in 400s +10% in 5000s
Migrate speed 14 MB/s 197 KB/s 16 KB/s
Duration estimate 1m10s 1h6m 14h
UI responsiveness good bad good

I'm running out of options on how to approach this. I think a reasonable choice is to do "mobile with maxCpu=98", make a release of Manyverse with db1 + db2 and give users a week to let migration complete. Then, a subsequent version uses only db2, and deletes the old log when migration is done.

Thoughts @arj03 ?

mixmix commented 3 years ago

This is just the migration right? Does this cost impact initial sync? If not I would not worry too much.

Main question would be how long does this take if manyverse gets put to sleep ) :

mixmix commented 3 years ago

I would be fine with it taking an hour IF I was told that manyverse will be doing that, and to plug my phone in and set it going when I'm not needing manyverse. Bonus points for progress bar....

staltz commented 3 years ago

Yeah it's just migration and will only happen once. For new users who don't have an old database it won't happen

arj03 commented 3 years ago

I'm trying to understand why it is so slow first. Is this with any db2 indexes running while it is migrating or only migrating the log?

staltz commented 3 years ago

I tried both (and it wasn't faster), but yes, there are level and jit indexes building at the same time, on both desktop and mobile.

arj03 commented 3 years ago

I'll have a look at migrate and see if I can spot something

arj03 commented 3 years ago

On my laptop with power connected. Running migrate with default options.

Progress overhead: 2-5s

Edit: ran it a few times, the overhead is not so bad.

arj03 commented 3 years ago

Disabled base + key index: 33s. Compared to around 58s before.

arj03 commented 3 years ago

Running them afterwards is 23s

arj03 commented 3 years ago

I tried maxCpu & afterwards with maxPause as well, like this:

const SecretStack = require('secret-stack')
const caps = require('ssb-caps')
const path = require('path')

const sbot = SecretStack({
  caps,
})
  .use(require('./'))
  .use(require('./migrate'))
  .call(null, {
    path: '/home/arj/.ssb',
    keys: require('ssb-keys').loadOrCreateSync(path.join('/home/arj/.ssb', 'secret')),
    db2: {
      automigrate: true,
      maxCpu: 98,
      maxPause: 180
    }
  })

Didn't seem to change overall performance. Still around 60s with indexes.

staltz commented 3 years ago

Yeah, it might be you can only reproduce this on phones.

staltz commented 3 years ago

Strangely, migrate got even slower with the latest changes (version 1.17.1).

Without maxCpu: 3h With maxCpu=98: 21h

arj03 commented 3 years ago

So on mobile it is 180 minutes. Compared to 1 minute on my laptop. I could understand maybe 5 or 10, but 180x?

staltz commented 3 years ago

Yeah, I don't understand it

staltz commented 3 years ago

I could try other phones I have. It could be there's something weird about the Fairphone 3's disk.

staltz commented 3 years ago

Digging deeper, I put some performance measurements in the hot path of the migrate plugin:

      pull(
        oldLog.getStream({ gt: migratedOffset }),
  // pull.through to measure "get from old log" duration
        pull.map(updateMigratedSizeAndPluck),
  // pull.through to measure "update migr size" duration
        pull.map(toBIPF),
  // pull.through to measure "convert to bipf" duration
        pull.asyncMap(writeToNewLog),
  // pull.through to measure "write to new log" duration
        (drainAborter = config.db2.maxCpu
          ? drainGently(tooHotOpts(config), op, opDone)
          : pull.drain(op, opDone))
      )

And these are the logs that I got. Basically writing to disk is the bottleneck. Would be better to have nanosecond precision on those, to validate the 180x assumption. PS: in both cases there was no maxCpu set.

Click to see logs on desktop ``` get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 0ms ```
Click to see logs on mobile ``` get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 7ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 9ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 3ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 5ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 5ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms write to new log: 2ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 8ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms write to new log: 3ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms write to new log: 3ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 3ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms write to new log: 2ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 3ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 3ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms write to new log: 3ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 2ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms write to new log: 3ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 3ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms write to new log: 3ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms write to new log: 3ms ```
For reference, here's exactly how I calculated those nums ```js function throughUp(op, onEnd) { return function(read) { return function(end, cb) { if (end) onEnd && onEnd(end === true ? null : end); else op && op(); return read(end, cb); }; }; }; let start = 0 let postGet = 0 let postUpdate = 0 let postBipf = 0 let postWrite = 0 pull( oldLog.getStream({ gt: migratedOffset }), throughUp(() => { start = Date.now() }), pull.through(() => { postGet = Date.now() const dur = postGet - start console.log('get from old log: ' + dur + 'ms') }), pull.map(updateMigratedSizeAndPluck), pull.through(() => { postUpdate = Date.now() const dur = postUpdate - postGet console.log('update migr size: ' + dur + 'ms') }), pull.map(toBIPF), pull.through(() => { postBipf = Date.now() const dur = postBipf - postUpdate console.log('convert to BIPF: ' + dur + 'ms') }), pull.asyncMap(writeToNewLog), pull.through(() => { postWrite = Date.now() const dur = postWrite - postBipf console.log('write to new log: ' + dur + 'ms') console.log(' ') }), (drainAborter = config.db2.maxCpu ? drainGently(tooHotOpts(config), op, opDone) : pull.drain(op, opDone)) ```
arj03 commented 3 years ago

Can you test these 3 lines individually to see which one is slow?

      newLog.append(data, () => {})
      emitProgressEvent()
      if (dataTransferred % BLOCK_SIZE === 0) newLog.onDrain(cb)
staltz commented 3 years ago

Wow :sweat_smile: Something wrong with emitProgressEvent

Logs on mobile ``` get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 1ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 1ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 1ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 2ms emitProgressEvent: 1ms write to new log: 3ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 2ms write to new log: 2ms get from old log: 0ms update migr size: 0ms convert to bipf: 0ms newLog.append: 1ms emitProgressEvent: 1ms write to new log: 2ms get from old log: 1ms update migr size: 0ms convert to bipf: 0ms newLog.append: 0ms emitProgressEvent: 1ms write to new log: 2ms ```
arj03 commented 3 years ago

Nice find! Glad it was that one :smile:

staltz commented 3 years ago

By the way, it's indexing all indexes now in 5min, which is a massive improvement over the 25min I got earlier. And this is concurrently with migrate running too (in "inefficient mode" even).

I'll report later what it looks like with migrate not running.

arj03 commented 3 years ago

Wup wup, really glad we found the bugger. It couldn't hide forever :-)

staltz commented 3 years ago

I still need some way of reporting migration progress, but I'll take a performance-oriented approach where I'm running these benchmarks on the phone, making sure it's good, and only then submitting a PR that passes tests.

arj03 commented 3 years ago

Yeah, you could try just emitting progress every 1000 instead

staltz commented 3 years ago

Technically, that's already the case, but the 1000 check is deep inside emitProgressEvent(), instead, I could put the check outside of it.

staltz commented 3 years ago

Initial indexing (everything) took 6m35s without migrate running. I think that's a pretty awesome achievement compared to the many hours Manyverse with flumedb needed previously.

staltz commented 3 years ago

Awesome, full rebuild of the db2 folder (migrate and indexes) took 9m34s. No maxCpu. And strangely, UI responsiveness was good, first time I have that with maxCpu turned off. If that's consistently true, I could remove maxCpu support from the codebase, but not yet. Let it stay there just in case. I'm very happy that we have this result and not the 3h or 22h, but I'm spooked that such a simple function, emitProgressEvent(), was the culprit. Is it the event emitter logic that was the problem?

staltz commented 3 years ago

OMG this is closed.