Read amplification on writes kills write-perf on large repositories

hsanjuan commented 1 year ago

Checklist

[X] My issue is specific & actionable.
[X] I am not suggesting a protocol enhancement.
[X] I have searched on the issue tracker for my issue.

Description

In the past few days I have had the luxury of diving a little deeper to improve ingestion performance on Kubo, after dag import slowed to ridiculous speeds.

My test setup uses Kubo from master (0.19.0-dev) in offline mode, with the Pebble datastore backend and ZFS setup on spinning disks with in-mem and NVME caching. The first thing to notice is that, while overall the write-rates were slow during dag import, read rates were maxed out.

I discovered the following:

First, block-writes from the CAR are batched (ipld-format.Batch). These batches have a hardcoded values of :

Max Batch Size: 8 MiB / num CPUs.
Max Batch Items: 128 items / num CPUs.

Assuming any modern laptop has 20 cores, we arrive at batches of 400KB at most, or 6 items per batch (!). This are VERY low when trying to import several GBs and millions of blocks and result in a large number of transactions.

Both the blockservice and the blockstore layers perform a Has() on every element of the Batch under the assumption that doing Has() is cheaper than writing them.

The blockstore is additionally wrapped on a Bloom+ARC caches, but the default size of the bloom filter is 512KiB with 7 hashes and the ARC a mere 64KiB:

On a 1M block datastore, the bloomfilter will result in 1/4 false positives: https://hur.st/bloomfilter/?n=1000000&p=&m=512KiB&k=7. The bloomfilter is configurable at least, but I'm not even sure if it is enabled by default (BloomFilterSize set to 0 in my default config).
The ARC cache has space for about 2000 CIDs (34 bytes per CID). This is not configurable on Kubo.

Anything that requires writing many thousands of blocks (let alone millions) will result in:

Large number of small transactions submitted to the datastore layer (per the bad batch sizes)
If the block is new: blockservice will do one Has() operation with probability of having a false positive in the bloom filter. If that happens, a second Has() call that will miss ARC cache and hit the datastore. The same will be repeated by the blockstore.
If the block exists: one Has() operation hits the bloom filter, then the ARC cache and, if it misses, it will hit the Datastore. This will skip the item from the batch if the datastore does have it.

The happy path here is that a) the block is not known b) the user has configured a large-enough bloom filter, (I'm unsure of how much impact it is to hash the key 14 times).

The ugly side is when Has() hits the datastore backend. A modern backend like Badger or Pebble will include an additional BloomFilter and potentially a BlockCache too. However:

Flatfs does not: a Has() call will require opening folders with many thousands of items already, which can be a very expensive operation and heavily amplifies writes (even more if the user is using the default next-to-last/2 sharding function for a very large repo).
On badger V1, I'm not even sure how much of a bloom filter it has and if it works (Badger v3 mentions it more prominently at least). In case of a positive result, however, answering a Has() will require actually checking for the element, which involves reading indexes that are probably on disk, causing read amplification. BadgerV1 can be tweaked to an extent, but options are not exposed and Kubo defaults are very bad.
On Pebble, the Has() operation also reads values when the blocks exist (only to discard them), causing additional latencies.

So I am running tests with dag import using Pebble as the backend, with a WriteThrough blockservice and Blockstore, no ARC/Bloomfilter, and increased number of Batch items and Batch Sizes and the time it takes to import ~10GB (~10M) blocks of DAG went from 60+minutes to 6 (a x10 improvement).

Disk stats show how read-pressure went from being Maxed out at 500MB/s and being able to sustain few writes, to a more sustainable read pressure and higher write-throughput.

The issue with Has() calls was additionally confirmed from pprof profiles. My imported CAR files do have about 25% overlap with existing blocks on the repository, so I'm definitely not being able to enjoy the fast-path that hits the bloom-filter and nothing else (even though configured with a large size). The fact that Pebble backend reads also values for the keys on the Has() calls when the block exists does not make things better either, but flatfs would probably be even worse.

Overall this explains a lot of the perf bottleneck issues we have been seeing in the past when pinning large DAGs on Kubo nodes with large datastores (in this case using flatfs): the disks gets hammered from Read operations and this affects the speed at which writes and everything else can happen.

My recommendations would be the following:

caching blockstore + ARC filter should only be enabled with flatfs as they are redundant with other datastores, and default settings should be significantly higher and exposed to the users for configuration.
Modern Badger3 or Pebble backends should be adopted and made default for Kubo, with most relevant options exposed to the user for configuration. As a reminder, Badger is a bad fit for production setups due to the long startup delays on multi-terabyte setups, even though subsequent performance is probably orders of magnitude better than flatfs, particularly V3. BadgerV1 has other significant issues too.
Batch sizes should be significantly increased to sizes of several MBs, and allow user configuration (they impact the number of open files for flatfs, sure, but probably flatfs shouldn't be default).
The "profiles" system could be extended to actually adjust settings above to low-memory and high-memory setups. Currently many hardcoded values are optimized so that Kubo can run with 1GB of RAM, which it barely can, only to break in other ways.

hsanjuan commented 1 year ago

Other issues I have found:

The bloom filter is not built "as we go", but rather fully primed on start, which triggers a datastore-wide query for ALL keys. This is not datastore friendly in the case of Pebble (will result in essentially reading the full datastore on every start). Badger can and does store keys and values separately, but it will also mean a fair amount of reading as soon as indexes don't fit in memory.
The ARC cache PutMany operation not only removes elements present in the cache from the Set, but also takes the liberty of sorting and removing duplicates. I don't know on which occasion we are writing batches with duplicate items, but a large batch of new blocks results in extra latency for 0 value.

hsanjuan commented 1 year ago

Another 5-10% of time saved by removing cbor decoding of blocks and just doing blockstore.PutMany(slice of 100MB worth of blocks).

cbor decoding showed prominently in cpu profiles. The ipld-format.Batch code which sends batches in asynchronous fashion and with NumCPUs() parallelism does not seem to compensate the whole effort (for Pebble). Not doing that actually causes less write pressure (only 1 batch committed at a time). I have no idea if anyone measured anything before opting for the current double-decoding.

Jorropo commented 1 year ago

The bloom filter is not built "as we go", but rather fully primed on start, which triggers a datastore-wide query for ALL keys. This is not datastore friendly in the case of Pebble (will result in essentially reading the full datastore on every start). Badger can and does store keys and values separately, but it will also mean a fair amount of reading as soon as indexes don't fit in memory.

Sounds like we could checksum the datastore in some way and save the current bloomfilter when stopping, then when restarting if the datastore checksum have not changed we would just load the existing bloomfilter. I guess we would checksum the top of the journal, so we don't have to checksum the full dataset which would solve nothing, or ideally badger, pebble, ... already have a similar option.

hsanjuan commented 1 year ago

Or just disable bloom filter and let badger3/pebble/flatfs handle how to speed up lookups.

ipfs / kubo

Read amplification on writes kills write-perf on large repositories #9678

Checklist

Description