feature: gzip multi member dependant chunker / importer, warc, tar

donothesitate commented 7 years ago

Version information:

go-ipfs version: 0.4.4

Type: Feature, Enhancement

Priority: P4

Area: Tools, Importer

Description:

Like in case of WARCs, gzip files do support multiple members, effectively making it possible to stitch together large files from smaller ones by mere concatenation.
This gives the possibility to compress meta and each record separately, concatenate onto a single file, then do partial fetches and decompression, including HTTP Range requests.

By having the static chunker also split at gzip member bondaries, one can easily construct .tar.gz files, or .tar of .gz files, and all sorts of derived data sets easily, without duplication.

There are two ways to approach this: a) the chunker works as usual, but also additionally splitting a block at member boundary
(resulting in 1:1 result, except replacing one block per member with two split in half) b) the chunker works as usual, but when encountering gzip member boundary, it makes one block smaller, starting new member in it's own 256k data block
(resulting in shift, and hence duplication of data. probably not the way to do it)

This should work for all gzip files, tar files, and more.

whyrusleeping commented 7 years ago

Might be cool to start an ipfs/importers repo where we can collect ideas like this

bqv commented 3 years ago

This never went anywhere, did it

lidel commented 1 year ago

If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too).

That being said, I like the generalization proposed here, to make Kubo's ipfs add smarter. Kubo could detect gzip / ZIP archives / TAR streams and use custom chunkers for known formats.

For example, ZIPs start with the same magic bytes (0x50, 0x4b, 0x03, 0x04 ← https://en.wikipedia.org/wiki/List_of_file_signatures).

ikreymer commented 1 year ago

If anyone wants userland solution for ZIP archives, @ikreymer did some related work in https://github.com/webrecorder/ipfs-composite-files (for WARC, but approach works for regular ZIPs too).

That being said, I like the generalization proposed here, to make Kubo's ipfs add smarter. Kubo could detect gzip / ZIP archives / TAR streams and use custom chunkers for known formats.

For example, ZIPs start with the same magic bytes (0x50, 0x4b, 0x03, 0x04 ← https://en.wikipedia.org/wiki/List_of_file_signatures).

For gzip, it would be 1f 8b 08 i think.

Yep, the library is designed to be fairly generic, the tests use WARC/WACZ/web archive data, but the commands are all generic and should work with any unixfs directories files and the in-place ZIP, with any ZIP file.

I do like the idea of detecting file types automatically, rather than having to provide pre-determined split points as we're doing here. For our use case, would probably keep the pre-computed split offsets file as we already have that, but happy to support/work with more generic multi-member gzip splitting efforts.

ipfs / kubo