ipld / go-car

A content addressible archive utility
Other
145 stars 44 forks source link

Creation of CARv2 from directory instead of file #475

Closed tssa88 closed 10 months ago

tssa88 commented 10 months ago

Hi guys! This is more of a question:

What Is the correct way to create a CAR v2 file from a directory instead of a file?

Not sure if I missed the docs, but I couldn't find it.

Thanks in advance!

rvagg commented 10 months ago

You probably didn't miss it in the docs, that's something that needs work! What you're looking for is car create -f output.car path/to/pack/:

rvagg@fletcher /tmp$ mkdir demo
rvagg@fletcher /tmp$ echo "one" > demo/1
rvagg@fletcher /tmp$ echo "two" > demo/2
rvagg@fletcher /tmp$ echo "three" > demo/3
rvagg@fletcher /tmp$ mkdir demo/boop
rvagg@fletcher /tmp$ echo "bop" > demo/boop/bop
rvagg@fletcher /tmp$ car create -f demo.car ./demo/
rvagg@fletcher /tmp$ car ls demo.car
bafkreibmrmenuxhgaomod4m26ds5ztdujxzhjobgvpsyl2v2ndcskq2iay
bafkreibh3whnisud76knkv7z7ucbf3k2rs6knhvajernrdabdbfaomakli
bafkreihwsnuregceqh263vgdathcprnbvatyat6h6mu7ipjhhodcdbyhoy
bafkreiewormbrscomx46wfh7sqqiljugolo74g3cxmeu5bbuolq2g5eak4
bafybeiakxswmjdngjud4qgxuoysdcgkveazrh32yvy6xib6h2cyxyt7q5a
bafybeifcgnlpk6lpg6hgk2an6pobfibczctwwfqbgtdrq3hagrecqbqvsu
bafybeievdzq3d3xge5f6xlqfss6xqyjqwaixqcvezzrhkh4tohz3vpr5di
rvagg@fletcher /tmp$ mkdir extract
rvagg@fletcher /tmp$ cd extract
rvagg@fletcher /tmp/extract$ car extract -f ../demo.car
extracted 4 file(s)
rvagg@fletcher /tmp/extract$ find .
.
./demo
./demo/1
./demo/2
./demo/3
./demo/boop
./demo/boop/bop

By default it makes a CARv2 (although I think we should switch that to a v1), you can use --version to change the version it uses.

In general you should only need to make a CARv2 if you're using this locally as a blockstore. If you're planning on using this for archival or sharing, just make a CARv1, it's more compact and the v2 additions will likely need to be stripped when being consumed by someone else anyway because the index isn't something that should generally be trusted when supplied by a third-party.

tssa88 commented 10 months ago

@rvagg Thanks for such a complete answer! Noob question: What's the diff between using this CLI tool and the https://github.com/web3-storage/ipfs-car CLI tool? In addition, is there a way to specify CID version during CAR creation?

rvagg commented 10 months ago

@regcs764 hopefully not much difference at all. One is Go, the other is JS, but they should be doing roughly the same thing. ipfs-car is (probably) going to give you CARv1 files (good) and they may not be exactly the same as the ones produced here which can happen due to differences in UnixFS packing defaults -- details like the way large files are chunked, or the way large directories are packed.

The choice shouldn't matter too much.

There's a possibility that when dealing with particularly large directories to pack, that go-car may be the better choice, for performance and resource reasons, but you may want to compare times and output between them.

car inspect <file> with go-car might be interesting to run on the output of both; it gives you basic information about what's in the CAR.

tssa88 commented 10 months ago

@rvagg TY! And how about the CID version? Is there a way to specify it during CAR creation and set if it's a V0 or V1?

rvagg commented 10 months ago

Good question, I don't think either library currently offers a way to switch from the default which should be CIDv1 these days. At least go-car does and they are currently fixed @ https://github.com/ipfs/go-unixfsnode/blob/7db26c0c9ffb19a4aba4299e0e2e4425d5598633/data/builder/file.go#L60-L76

ipfs-car in theory has the ability to pass through some adjustments to the defaults as they are configurable @ https://github.com/ipld/js-unixfs/blob/e7b9ad5510c32e69dc9a6f60d49c6abe084048fe/src/file.js#L22-L23

Do you need CIDv0? Most of our tooling these days is defaulting to CIDv1, even Kubo. Making an option for it in go-car is do-able, but it's a matter of prioritisation--if there's a good use-case for it then maybe we could make it happen.

tssa88 commented 10 months ago

@rvagg Great! TY!

And regarding unpacking CARV1/V2 files. Are there any docs/examples on how to unpack them?

rvagg commented 10 months ago

car extract -f <car file> is the main way you'd do this, it assumes it's going to find file data. There's also a --path that you can use to be specific about what you want to extract if you don't want to extract everything. There's not much else to it, it's one of those utilities that could be better but is waiting for users with specific needs. Recently it grew the ability to accept stdin and even spit a single file to stdout so you can use it streaming (e.g. with https://github.com/filecoin-project/lassie - fetch content, stream to stdout to go-car, extract to stdout and stream to somewhere else, such as a video player). It'll work with v1 or v2 files without noticing the distinction.

tssa88 commented 10 months ago

@rvagg QQ.. For instance, I tried to extract this CAR file and got this error:

car extract -f /private/tmp/ipfs/bafkreicgytal6uyhwtt57rjksxotfzqqljriwcgfrq2xdiv6ykactslu3y.car
no files extracted

Doing the same using ipfs-car works seamlessly.

ipfs-car unpack /private/tmp/ipfs/bafkreicgytal6uyhwtt57rjksxotfzqqljriwcgfrq2xdiv6ykactslu3y.car --output /private/tmp/ipfs/test

Do you know why it happens? Any thoughts?

rvagg commented 10 months ago

So bafkrei... is a "raw" block, there's nothing in there that tells us anything about what it is other than it's a block of raw bytes; it's unclear what to do with it. It's why it's a good idea to "wrap" files in a directory when packing them (car create will do this but also has a --no-wrap to skip it). We could make go-car extract these for you naively, it's not something we've opted to do at the moment, instead preferring explicit UnixFS content to be present--remember that CARs don't always contain file data, and extract currently only makes sense for file data. We could perhaps make it restrictive - where it'll work if the CAR contains a single raw block and you must provide an --output, which I imagine is what ipfs-car is doing.

go-car can still get it for you though: car get-block <file.car> <block cid> [output file] will fetch individual blocks and dump the data into a file, just supply your CAR path and the bafkrei... (raw) CID and it should do exactly what you want.

tssa88 commented 9 months ago

Thanks @rvagg. I can confirm using car get-block <file.car> <block cid> [output file]does work. However, there is another scenario I could not find a solution for:

I'm getting this error when I try to run extract for another CAR file:

car extract -f /tmp/bafyreihhdxf6vr5vembqausvrsukarksvms3vob5jzj7kgwe76rurjp7he.car
2023/09/14 12:16:01 invalid key for map dagpb.PBNode: "name": no such field

Same file also errors when using ipfs-car Unsupported UnixFS type object for path: bafyreihhdxf6vr5vembqausvrsukarksvms3vob5jzj7kgwe76rurjp7he

Any thoughts?

rvagg commented 9 months ago

Yeah, bafyreihhdxf6vr5vembqausvrsukarksvms3vob5jzj7kgwe76rurjp7he is a dag-cbor block, which can't contain unixfs data, so extract won't work for it. If you want to inspect it, you could use car debug which converts everything in there to pretty-printed dag-json so it's human-readable; but beware that it's not actually very pretty and gets unwieldily with large amounts of data (although I find myself using it a lot, even with large data!). car debug /tmp/bafyreihhdxf6vr5vembqausvrsukarksvms3vob5jzj7kgwe76rurjp7he.car | less and have a look inside and see if you can make sense of it.

It may be one of these cases where there's a dag-cbor block that is just there at the root to link multiple unixfs chunks together, so that bafyreihhdxf6vr5vembqausvrsukarksvms3vob5jzj7kgwe76rurjp7he block could just have links in it and those links are the pieces you want. If they are raw blocks, then you could car get-block that block directly.

Currently there's no way to car extract where the root of the CAR isn't what you want, but I do have a local branch where I added a --root to car extract so I could specify my own in the CAR and extract from there. So in theory this dag-cbor block wouldn't get in the way, you just figure out what your actual data root should be and extract from that. But you're getting into the weeds with this and maybe you're better off building tooling to work with the specific data you're dealing with (if this isn't a one-off). Or maybe figure out where the data is coming from and see if the producer has tooling to deal with it?

rvagg commented 9 months ago

btw https://cid.ipfs.tech/ is a great tool for inspecting CIDs and figuring out what they are; if you put in bafyreihhdxf6vr5vembqausvrsukarksvms3vob5jzj7kgwe76rurjp7he you'll see it's a dag-cbor block. Although you'd also have to know that can't contain unixfs data which may not be common knowledge! dag-pb is the codec that's designed for containing unixfs data; and raw are these blocks you can car get-block straight out.