ipfs / helia

An implementation of IPFS in JavaScript
https://helia.io
Other
811 stars 81 forks source link

fix: skips exporting duplicate blocks as they are encountered #557

Open jtsmedley opened 2 weeks ago

jtsmedley commented 2 weeks ago

Title

Skip exporting duplicate blocks in @helia/car

Description

When calling export on an @helia/car instance the yielded export contains duplicate blocks. This PR skips yielding duplicate blocks resulting in compact CAR files.

Notes & open questions

Added a SET to track which CID(s) have already been written to the stream. Might need to add an option to toggle behavior if any users are relying on duplicate blocks in the output of CAR exports.

Change checklist

jtsmedley commented 2 weeks ago

Fixes #556

jtsmedley commented 2 weeks ago

For the sake of consistency Kubo does not export duplicate blocks. At least from my limited testing.

lidel commented 2 weeks ago

Dropping a drive-by comment with some historical context, hopefully its useful and not noise :)

What specs say?

Last time we looked into this the default behavior is left unspecified, and trustless gateway responses have explicit note about this:

Gateway's CAR responses in Kubo and Rainbow (both backed by boxo/gateway) do not include duplicates by default.

There is dups parameter that can be used for making this explicit, allowing for signaling duplicate presence in response via Content-Type header params:

Kubo/Rainbow use it to signal the response does NOT have duplicates. Afaik no client requests CARs with specific dups. If there are no duplicates, bandwidth is saved, if they are present, they are noop.

Sidenote: Ok, but when are duplicates useful?

Duplicates have niche utility when you have a light client that has limited memory, so it can't cache blocks for later user, and has to “consume” blocks by unpacking data on the fly, and then discarding them (no blockstore, no cache, even in-memory).

Such client would explicitly opt-in for receiving duplicates via dups=t. FWIW as of today, I am not aware of any clients that require this behavior, every client is usually smart enough to spawn follow-up application/vnd.ipld.raw request for any missing blocks, as this works with every gateway.

On this PR

I think the only concern here is that writtenBlocks could grow in size and become DoS / OOM vector if the user tricks the system into exporting a big DAG with many duplicated CIDs.

Having the ability to control this potential memory leak at the library level (and disable deduplication, and the memory cost that comes with it) would be nice, but can be added once this becomes an actual and not theoretical problem. Up to Helia maintainers.

achingbrain commented 2 weeks ago

I think the only concern here is that writtenBlocks could grow in size and become DoS / OOM vector if the user tricks the system into exporting a big DAG with many duplicated CIDs.

This can be mitigated by using a filter instead of a set, that way we don't need to store every CID encountered.

See createScalableCuckooFilter in @libp2p/utils

import { createScalableCuckooFilter } from '@libp2p/utils/filters'

const filter = createScaleableCuckooFilter(maxItems, errorRate)

filter.has(bytes) // returns boolean
filter.add(bytes) // `.has` will probably now return `true`

The only thing to consider is the reliability of the filter. False positives are possible but false negatives are not. That is, if .has returns false the item is definitely not present. If it returns true then the probability of the item being present is a factor of the errorRate.

The implementation should be structured so that there may be the (very) occasional duplicate block but there should never be a missed block.

achingbrain commented 2 weeks ago

Good to know about the Kubo behaviour too, we should align with that though an allowDuplicateBlocks: boolean option would be good to have.

jtsmedley commented 2 weeks ago

I need the allowDuplicateBlocks: boolean option anyways so I will go ahead and add that into this PR to round it out.

jtsmedley commented 1 week ago

In this PR I am sticking with a set as I need no duplicates in the resulting CAR files and I need that to be consistent always.

achingbrain commented 1 week ago

If you merge main, the current CI error will go away.

jtsmedley commented 1 week ago

Switched to passing in an external filter to allow the consumer to choose how blocks are filtered without a specific purpose.

blocksFilter: Filter(@libp2p/utils/filters)

jtsmedley commented 4 days ago

Are any other changes needed?