storacha / ipfs-car

🚘 Convert files to content-addressable archives and back
Other
150 stars 45 forks source link

Support packing deterministic CAR files #76

Open vasco-santos opened 3 years ago

vasco-santos commented 3 years ago

Write the graph out in deterministic graph traversal order instead of in the order it parses the files

Current state

The current implementation of ipfs-car writes the CAR file blocks in any specific order, as follows:

This means that we currently have a different output for the same file as go-ipfs and js-ipfs, which do an ordered walk.

Motivation

Supporting deterministic outputs will enable ipfs-car to have the same output CAR as the core ipfs implementation and move us towards supporting other use cases like interact directly with Filecoin (and perhaps offline deals).

Implementation

Given we currently have two iterations (unixfs importer + blockstore iteration), we can support a deterministic output by getting the root and traverse the graph like https://github.com/ipld/js-datastore-car/blob/master/car.js#L198-L221

We should make this optional and pluggable, given we will need to add codecs and hashers which would increase the dependency footprint for users who not need deterministic CAR files.

We can alternatively support a different function where we do not do the two iterations and keep everything in memory. This would be faster and some users could be ok with the extra memory consumption. But, I would say the write performance to create the CAR file is not the biggest concern, and we have been focusing on efficiency more on Reads than Writes.

cc @rvagg @olizilla @mikeal @alanshaw

rvagg commented 3 years ago

Additional note on an ideal here: we should not hide yet another "ordered DAG walk" implementation in here. I think it probably belongs in js-multiformats, it's in the same family of concerns as the Block functionality already there. It's just complicated by the need to have multiple codecs available. Such a walk function could be provided with:

  1. a list of supported codecs so it can decode blocks with those
  2. a list of codecs that are ok to not decode (this has an easy default of raw, json, cbor but there are potentially more a user may want to supply)
  3. an indication of what to do when encountering a block that can't be decoded with the existing codecs - bail, or ignore?

https://github.com/ipfs/js-ipfs/blob/6a2c710e4b66e76184320769ff9789f1fbabe0d8/packages/ipfs-core/src/components/dag/export.js#L82-L107 has an implementation that's a little like this that we did for dag export. It would be good to implement something shared so we could even remove code from there.

mikeal commented 3 years ago

One requirement I’d like to surface here.

Users with large amounts of data are writing custom tooling to get their file data “into IPFS” so that they can then write out a CAR file suitable for Filecoin (which really needs to be deterministic).

There are obvious perf issues with moving this much data and suffering excessive copying in memory and on disc.

For these users:

dchoi27 commented 3 years ago

Given that you can retrieve the full DAG fine with a non-deterministic CAR file, this probably isn't the highest priority.

AugustoL commented 3 years ago

Hola, It would be great if the code maintainers or project managers can give more priority to this. Im working on a decentralized application and looking forward to migrate the content from IPFS to WEB3Storage but I want to do it in a deterministic way.

olizilla commented 3 years ago

@AugustoL can you say more about what you need? For many use cases, the CAR itself won't need to be deterministically packed. You can import an identical DAG from it with ipfs dag import.