ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.03k stars 3k forks source link

Read amplification on writes kills write-perf on large repositories #9678

Open hsanjuan opened 1 year ago

hsanjuan commented 1 year ago

Checklist

Description

In the past few days I have had the luxury of diving a little deeper to improve ingestion performance on Kubo, after dag import slowed to ridiculous speeds.

My test setup uses Kubo from master (0.19.0-dev) in offline mode, with the Pebble datastore backend and ZFS setup on spinning disks with in-mem and NVME caching. The first thing to notice is that, while overall the write-rates were slow during dag import, read rates were maxed out.

I discovered the following:

First, block-writes from the CAR are batched (ipld-format.Batch). These batches have a hardcoded values of :

Assuming any modern laptop has 20 cores, we arrive at batches of 400KB at most, or 6 items per batch (!). This are VERY low when trying to import several GBs and millions of blocks and result in a large number of transactions.

Both the blockservice and the blockstore layers perform a Has() on every element of the Batch under the assumption that doing Has() is cheaper than writing them.

The blockstore is additionally wrapped on a Bloom+ARC caches, but the default size of the bloom filter is 512KiB with 7 hashes and the ARC a mere 64KiB:

Anything that requires writing many thousands of blocks (let alone millions) will result in:

The happy path here is that a) the block is not known b) the user has configured a large-enough bloom filter, (I'm unsure of how much impact it is to hash the key 14 times).

The ugly side is when Has() hits the datastore backend. A modern backend like Badger or Pebble will include an additional BloomFilter and potentially a BlockCache too. However:

So I am running tests with dag import using Pebble as the backend, with a WriteThrough blockservice and Blockstore, no ARC/Bloomfilter, and increased number of Batch items and Batch Sizes and the time it takes to import ~10GB (~10M) blocks of DAG went from 60+minutes to 6 (a x10 improvement).

Disk stats show how read-pressure went from being Maxed out at 500MB/s and being able to sustain few writes, to a more sustainable read pressure and higher write-throughput.

image

The issue with Has() calls was additionally confirmed from pprof profiles. My imported CAR files do have about 25% overlap with existing blocks on the repository, so I'm definitely not being able to enjoy the fast-path that hits the bloom-filter and nothing else (even though configured with a large size). The fact that Pebble backend reads also values for the keys on the Has() calls when the block exists does not make things better either, but flatfs would probably be even worse.

Overall this explains a lot of the perf bottleneck issues we have been seeing in the past when pinning large DAGs on Kubo nodes with large datastores (in this case using flatfs): the disks gets hammered from Read operations and this affects the speed at which writes and everything else can happen.

My recommendations would be the following:

hsanjuan commented 1 year ago

Other issues I have found:

hsanjuan commented 1 year ago

Another 5-10% of time saved by removing cbor decoding of blocks and just doing blockstore.PutMany(slice of 100MB worth of blocks).

cbor decoding showed prominently in cpu profiles. The ipld-format.Batch code which sends batches in asynchronous fashion and with NumCPUs() parallelism does not seem to compensate the whole effort (for Pebble). Not doing that actually causes less write pressure (only 1 batch committed at a time). I have no idea if anyone measured anything before opting for the current double-decoding.

Jorropo commented 1 year ago

The bloom filter is not built "as we go", but rather fully primed on start, which triggers a datastore-wide query for ALL keys. This is not datastore friendly in the case of Pebble (will result in essentially reading the full datastore on every start). Badger can and does store keys and values separately, but it will also mean a fair amount of reading as soon as indexes don't fit in memory.

Sounds like we could checksum the datastore in some way and save the current bloomfilter when stopping, then when restarting if the datastore checksum have not changed we would just load the existing bloomfilter. I guess we would checksum the top of the journal, so we don't have to checksum the full dataset which would solve nothing, or ideally badger, pebble, ... already have a similar option.

hsanjuan commented 1 year ago

Or just disable bloom filter and let badger3/pebble/flatfs handle how to speed up lookups.