ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.11k stars 3.01k forks source link

ipfs files concat [ <local paths> | <cids> ] #9177

Open lidel opened 2 years ago

lidel commented 2 years ago

Documenting discussion with @ikreymer, @rangerMauve and @ribasushi

We are missing a high level API for concatenating existing UnixFS files into bigger ones. Having it would allow for improved deduplication in scenarios when bigger archives in formats like WARC (https://webrecorder.net) consist in big part of smaller files that are already on IPFS, allowing for CID/DAG reuse.

Use cases

Proposed design

Add concat command to ipfs files that accepts two or more UnixFS-compatible DAGs and returns a CID that is a logical concatenation of all DAGs.

$ ipfs files concat [ /local/mfs/paths | /ipfs/cids ] 
bafy....

FAQ / Open questions

We need to agree how to handle edge cases, below are my initial ideas, feedback on ergonomics and potential implementation caveats is appreciated

ribasushi commented 2 years ago

My take is: hard-error on directories, support only files and pipes. Just like /bin/cat

RangerMauve commented 2 years ago

I put together a test repo using js-unixfs to show how concat could work under the hood with building up nodes from several sub nodes.

https://github.com/RangerMauve/js-ipfs-stitch-test/

Agreed that directories should be an error. I don't think we can cat a UnixFS tree with directories in it, so concatenating a directory in there seems like a separate use case.

ikreymer commented 2 years ago

Another high-level API, which would be super useful, and essentially becomes easy to support, given the core ipfs files concat functionality, is a way to start with a single file and a list of splitpoints/offsets that you'd want to split on.

It could be a subcommand: ipfs files concat add <local path> <split points>, where split points just contains a JSON array, or offset per line, that would then read local path <local path> and add regular those offsets, and then concat the whole thing. Eg. given a 35M file, and offsets [0, 10M, 25M], the command would add 0-10M of file, add 10-25M, and add 25M-35M of the file. Maybe could support other add options, like being able to choose trickle dag?

Maybe there's two subcommands: ipfs files concat add <local path> <split points> and ipfs files concat merge [ <local paths> | <cids> ] if the split files already exist as individual files or already added as CIDs.

This just adds a common first step that would often be needed before using ipfs files concat

ribasushi commented 2 years ago

@ikreymer too complex. You'd simply:

ipfs files concat yourfile:0:20 yourfile:21:40 yourfile:41:

ikreymer commented 2 years ago

@ikreymer too complex. You'd simply:

ipfs files concat yourfile:0:20 yourfile:21:40 yourfile:41:

yeah, I guess could live with that, was just thinking the separate split file makes for an easier user API, especially if to be supported in libraries as well as CLI, and maybe dealing with hundreds of split points..

ikreymer commented 2 years ago

I've implemented a small library in JS that includes concat as well as some related utilities that are useful for the web archiving use case: https://github.com/webrecorder/ipfs-composite-files

anjor commented 1 year ago

Wrote something in go: https://github.com/anjor/unixfs-cat/blob/main/unixfs_cat.go

Happy to work more on it if it's useful/along the lines of the thinking here.