Rocksdb performance tuning ideas

unnawut commented 4 years ago

Rocksdb contains performance tuning points we can try. Full guide: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

These are some examples but need to be explored whether they're applicable to our uses. Many of these may not be applicable.

Collect built-in statistics

DB-level stats: https://github.com/facebook/rocksdb/wiki/Statistics
Operation-level stats: https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context

Convert iterators on large data to snapshots

Short-lived/foreground scans are best done via an iterator while long-running/background scans are better done via a snapshot.

https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#gets-iterators-and-snapshots

I'm assuming it is performant because snapshots do not make a copy of the data as its name might imply, and the data structure is immutable so it’s just marking data with snapshot pointers and doesn’t lock writes because it’s read-only.

Consider restructuring the prefixes

do you see a way to restructure the data storage - to segregate exitable utxos per key==account?

Use column families to separate logical entity types

Currently all data including outputs, blocks, competitor info, etc. are all prefix-seeking on the same table.

RocksDB supports partitioning a database instance into multiple column families. All databases are created with a column family named "default", which is used for operations where column family is unspecified.

RocksDB guarantees users a consistent view across column families, including after crash recovery when WAL is enabled or atomic flush is enabled. It also supports atomic cross-column family operations via the WriteBatch API. https://github.com/facebook/rocksdb/wiki/Column-Families

We can tune the config per column family to suit each column family's data characteristics and access pattern.

Index optimization

An index block contains one entry per data block, where the key is a string >= last key in that data block and before the first key in the successive data block. The value is the BlockHandle (file offset and length) for the data block.

https://github.com/facebook/rocksdb/wiki/Index-Block-Format

It might be possible to check the current data block size and see if we're able to improve seek time by optimizing what we use as the key prefix, or on the other hand, adjust the block size.

Also more details on competing data and index caching:

When index/filter blocks are stored in block cache they are effectively competing with data blocks (as well as with each other) on this scarce resource. A filter of size 5MB is occupying the space that could otherwise be used to cache 1000s of data blocks (of size 4KB). This would result in more cache misses for data blocks. The large index/filter blocks also kick each other out of the block cache more often and exacerbate their own cache miss rate too. This is while only a small part of the index/filter block might have been actually used during its lifetime in the cache.

After the cache miss of an index/filter block, it has to be reloaded from the disk, and its large size is not helping in reducing the IO cost. While a simple point lookup might need at most a couple of data block reads (of size 4KB) one from each layer of LSM, it might end up also loading multiple megabytes of index/filter blocks. If that happens often then the disk is spending more time serving index/filter blocks rather than the actual data blocks.

https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters#what-is-the-big-deal-with-large-indexfilter-blocks

Non-sync writes

Since at application-level we're dropping the non-closed blocks on a crash anyway we should be able to apply non-sync writes on rocksdb too? https://github.com/facebook/rocksdb/wiki/Basic-Operations#non-sync-writes

Write stalls

Avoid write stalls: https://github.com/facebook/rocksdb/wiki/Write-Stalls

Secondary instance

Read-only operations can be routed to secondary instance: https://github.com/facebook/rocksdb/wiki/Secondary-instance

Replication

https://github.com/facebook/rocksdb/wiki/Replication-Helpers

unnawut commented 4 years ago

A reverse that's rarely used:

https://github.com/omgnetwork/elixir-omg/blob/master/apps/omg_db/lib/omg_db/rocksdb/core.ex#L125

  defp do_filter_keys(reference, prefix) do
    # https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-Changes#use-readoptionsprefix_seek
    {:ok, iterator} = :rocksdb.iterator(reference, prefix_same_as_start: true)
    move_iterator = :rocksdb.iterator_move(iterator, {:seek, prefix})
    Enum.reverse(search(reference, iterator, move_iterator, [])) # <--- This Enum.reverse()
  end

InoMurko commented 4 years ago

A reverse that's rarely used:

https://github.com/omgnetwork/elixir-omg/blob/master/apps/omg_db/lib/omg_db/rocksdb/core.ex#L125

  defp do_filter_keys(reference, prefix) do
    # https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-Changes#use-readoptionsprefix_seek
    {:ok, iterator} = :rocksdb.iterator(reference, prefix_same_as_start: true)
    move_iterator = :rocksdb.iterator_move(iterator, {:seek, prefix})
    Enum.reverse(search(reference, iterator, move_iterator, [])) # <--- This Enum.reverse()
  end

What did you mean by that?

unnawut commented 4 years ago

@InoMurko The Enum.reverse(...) in Enum.reverse(search(reference, iterator, move_iterator, [])). I don't think any of its caller is order-sensitive.

Not a major impact on perf though and slight impact on memory consumption.

InoMurko commented 4 years ago

ah, you mean that. If I remember correctly Piotr wanted it.

omgnetwork / elixir-omg