mafintosh / hyperdb

Distributed scalable database
MIT License
752 stars 75 forks source link

[WIP] add option for sorting iterator by the last added to the db. #81

Open e-e-e opened 6 years ago

e-e-e commented 6 years ago

@mafintosh this is just a quick stab at sorting the stream by latest added node.

This is not super efficient at the moment, as for each pop it is sorting the worker's stack. This could and should be optimised by checking if a sort is needed when pushing a new value to the stack - and only sorting if its needed. I started implementing this optimisation but it requires quite a lot of changes to deal with pending. So figured its best to check with you first, to gage if this method of sorting is heading in the right direction.

e-e-e commented 6 years ago

@mafintosh so this is an optimised version that actually works. I have included two options. One built on top of your iterator and the other as an external operator. Both are laboriously slow on large datasets For example - 10,000 keys split over two feeds, with 2000 overlapping keys - takes 20s approximately on the version built on top of your iterator, and 24 seconds using my older iteration method. By comparison sorting by hash takes just 5.6s.

This is the test I was using to benchmark

tape('iterator sorted via latest (large set)', (t) => {
  const expected = []
  const db1Keys = []
  const db2Keys = []

  for (var i = 0; i < 10000; i++) {
    var key = i.toString()
    if (i < 6000) {
      db1Keys.push(key)
    }
    if (i > 4000) {
      db2Keys.push(key)
    }
    expected.push(key)
  }

  create.two(function (db1, db2, replicate) {
    console.time('iteration')
    run(
      cb => put(db1, db1Keys, cb),
      replicate,
      cb => put(db2, db2Keys, cb),
      replicate,
      cb => testIteratorOrder(t, db1.iterator({ latest: true }), expected, 'latest', cb),
      // cb => testIteratorOrder(t, db2.iterator({ latest: true }), expected, 'latest', cb),
      () => {
        console.timeEnd('iteration')
        t.end()
      }
    )
  })
})

What are your thoughts? What do you think is the best way to approach this? Are the changes needed to implement it as an option of the default iterator too complex - is it worth breaking it into its own iterator? Any thoughts on possible ways to speed it up?

e-e-e commented 6 years ago

I am slow moving on this one - but will look at rewriting these to be more inline with @mafintosh's https://gist.github.com/mafintosh/08c29ff876f0444b7270cf8335cbf650