protocol / beyond-bitswap

Other
34 stars 9 forks source link

Should we import datasets as CAR files? #23

Open hannahhoward opened 3 years ago

hannahhoward commented 3 years ago

Related to #18 , I noticed that when we write tests that transfer a dataset (such as a folder / file) as opposed to just random generated bytes, we import the data into UnixFS from the file system (in normal system format). If our intent is to truly test some of the data sets on https://awesome.ipfs.io/datasets/ as they exist on IPFS, the reliable way to bring them in is to export them as CAR files and import into the tests. The reason is a UnixFS import from system files is not gauranteed to produce the exact same DAG or root CID. There are several variables that affect how the DAG is built -- such as chunking strategy, use of raw leaves, etc. The only reliable way to know you the exact same dag is to use CAR files. This might also make sense in terms of writing scripts to download datasets-- as long as IPFS is, unfortunately, not as fast as HTTP on a fast hosted site, it's going to be much more efficient if we can import into the seeds blockstore from a car file on a CDN network -- plus that means we don't have to download ahead of time -- we can probably just include it as part of the test, which makes things more reproducable on CI.

Anyway, curious to get your thoughts @adlrocha -- also does this make sense? It may not be obvious if you haven't worked a lot with UnixFS files and DAG structures.

hannahhoward commented 3 years ago

Also, CAR import is easier to do without a full IPFS node.

adlrocha commented 3 years ago

Hey @hannahhoward. What you suggest makes total sense, however, ideally it would be great support both schemes. The reason why we chose to use UnixFS is because we wanted to give users the ability to play with the block size, chunking strategy, etc. in order to explore how these metrics could affect the exchange performance. Also, we wanted to allow researchers to "easily bring their own datasets" to the testbed, and the easiest way we came up with was to just "drop your dataset into a folder".

On the other hand, to ensure experiments 100% replicable, I completely agree that it makes more sense to use of CAR files. How hard would it be to support both schemes at the same time (maybe adding a CAR input_data parameter)?

Also, CAR import is easier to do without a full IPFS node.

And also this! Which is a nice feature to explore low-level protocols.

In short, both options LGTM, if we end up exclusively supporting the use of CAR files we can always add scripts to generate CARs from your datasets with the desired chunking strategies, block sizes, etc.