Data compression - Githubissues

orivej commented 9 years ago

Consider the following design space: some nodes on the network are used for storage, they store and serve compressed data to optimize network throughput and storage efficiency; other nodes are used for computation with the data and they are space constrained (e.g. they must minimize SSD usage), so they store only uncompressed files, but as long as their data is unmodified, they can also seed it to nearby nodes. To support both compressed and uncompressed seeders, or to present compressed data to applications as being uncompressed, or to store and serve already seeded uncompressed data in a compressed form, it seems worthwhile to support data compression below application level, in IPFS. This suggests support for storage and transmission of IPFSObjects with an additional field specifying the method of decompression, whose hash is computed as if the data was not compressed and the field was not present. In a way, this resembles squashfs, which deflates data and then presents it as a normal read-only filesystem. Another option is to introduce a separate IPFSCompressedObject, advertise it both under its own and uncompressed hashes, and when another node requests it under uncompressed hash, negotiate either to serve the compressed object, or to decompress it prior to transmission. What do you think about this problem? Can the solution be simplified?

jbenet commented 9 years ago

@orivej yes, these thoughts are on the right track. I'll expand more on this in the next couple weeks-- but consider also program as compression. imagine a simple language embedded in the data segment of an object. let the "final data" of an object be the output of running the (pure) program embedded in the data segment. (e.g. in a unixfs file, the output is concatenating the data segments of all children in-order). This allows us to represent final objects as programs, perhaps getting us closer to the kolmogorov complexity of the data.

ion1 commented 8 years ago

Given an object with links to other objects and a representation of a program that uses those objects as input and whose output is the data the object represents, we could also do other neat things.

We could have normalizers for various data formats (see https://wiki.debian.org/ReproducibleBuilds/ExperimentalToolchain#strip-nondeterminism) which also output a program that takes byte ranges from the normalized version, applies possible transformations (decompression, binary diff etc.) to them and concatenates the results to recreate the original file (see https://joeyh.name/code/pristine-tar/).

For example, when user A and user B add a PNG image with the same pixels but the headers in a different order, both users would end up with the same large object for a normalized version of the image and different small objects transforming the normalized version back to what they added.

The language for the program must be pure (no nondeterminism or arbitrary IO) and strongly normalizing (no arbitrary recursion, no halting problem) and it must have static guarantees about memory consumption etc. for it to be safe.

jbenet commented 8 years ago

Yep! see also

LLVM IR https://github.com/ipfs/notes/issues/59
things like stackstream:
IPLD: https://github.com/ipfs/go-ipld/ and https://github.com/diasdavid/node-ipld
ipfs tar
and i talk about this in this talk https://youtu.be/h73bd9b5pPA?t=4689 ~ 1:18:00-1:27:00

davidar commented 8 years ago

@ion1 :+1:

Given an object with links to other objects and a representation of a program that uses those objects as input and whose output is the data the object represents, we could also do other neat things.

The language for the program must be pure (no nondeterminism or arbitrary IO) and strongly normalizing (no arbitrary recursion, no halting problem) and it must have static guarantees about memory consumption etc. for it to be safe.

@Gabriel439 could Morte/Annah be used for this?

and i talk about this in this talk https://youtu.be/h73bd9b5pPA?t=4689 ~ 1:18:00-1:27:00

@jbenet I like the way you think :)

ion1 commented 8 years ago

@jbenet

and i talk about this in this talk https://youtu.be/h73bd9b5pPA?t=4689 ~ 1:18:00-1:27:00

Very nice, but I don't think we can make it quite Turing complete. As long as the language has a construct for arbitrary recursion, someone can give me an object with an arbitrary hash and I'll wait up to the heat death of the universe trying to check whether its output matches the hash.

However, a program in such Turing-incomplete language can always output a program in an arbitrary language. As long as I can safely “ipfs get” an object outputting an unsafe program, I can choose to execute the unsafe program and have it launch the missiles in a non-halting loop.

davidar commented 8 years ago

@ion1 Morte guarantees termination by disallowing recursion, but provides some tricks for translating recursive programs into non-recursive ones so that it's not too much of a limitation in practice.

Of course, you could also just place a hard time limit on executing computable objects.

ion1 commented 8 years ago

I would really like the property that having downloaded a block and its dependencies, I am guaranteed to get the data out.

cryptix commented 8 years ago

As long as the language has a construct for arbitrary recursion, someone can give me an object with an arbitrary hash ...

Maybe I'm missing something but a valid IPFS object should have no cycles. An all links first-fetcher should see that miss-match quickly and error out.

ion1 commented 8 years ago

Given an object that promises to generate the data for the hash Qmaaaaaaaaaa… programmatically, how can you tell you got the correct object without running the program?

Gabriella439 commented 8 years ago

Yes, this is the use case for morte and annah, providing a high quality intermediate language for distributing and storing code fragments:

morte provides the strongly normalizing core language that is 100% pure with no ways to cheat/coerce/subvert the type system
annah translates recursive data types to non-recursive equivalents encoded in morte

morte is basically production ready. annah is still very experimental. The desugaring of recursive types is complete but decisions like how to encode strings and numbers are still in flux so it's not really ready for prime time just yet.

Using annah is not strictly necessary, it's purely syntactic sugar for morte code; I can step you through how to encode all your relevant datatypes and functions by hand without annah for your specific use case if you can explain exactly what you want the code to be able to do.

Relevant resources:

Morte: an intermediate language for super-optimizing functional programs - Overview of morte
The internet of code - Post that describes Morte's built-in support for securely distributing typed code fragments
Secure execution of untrusted scripts - Talk about Morte's security model
morte package on Hackage

ion1 commented 8 years ago

:+1:

Have you thought about a binary representation for Morte code?

Gabriella439 commented 8 years ago

Yes. Right now the binary representation is pretty naive, and there is a lot of room for compressing it further (by using bit-level encodings and symbol tables).

You can also compress things further by baking in the types of external primitives into the context that you use to type check your expressions.

ion1 commented 8 years ago

I don’t know how familiar you are with IPFS, but it would make a great fit since IPFS represents everything as a distributed content-addressed DAG where objects can have links to other objects (by their hash) as well as a data payload. A Morte program in an IPFS object could list its direct dependencies as links and IPFS would handle getting all the transitive dependencies for you. You can not even construct a cycle.

jbenet commented 8 years ago

@ion1

Very nice, but I don't think we can make it quite Turing complete. As long as the language has a construct for arbitrary recursion, someone can give me an object with an arbitrary hash and I'll wait up to the heat death of the universe trying to check whether its output matches the hash.

Hence bounded computation. turns out there's very good heuristics for this. javascript is turing complete and unbounded computation there is not really a problem.

Given an object that promises to generate the data for the hash Qmaaaaaaaaaa… programmatically, how can you tell you got the correct object without running the program?

no, the program's hash would be Qmaaaaaaaaaa.... the data representation would be a different hash.

@Gabriel439 this is great, thanks for the links. will look deeper this weekend. i think there's a great marriage for ideas here that we should explore. and we still have some flexibility to fit better models. (thanks @davidar for bringing it up!)

Gabriella439 commented 8 years ago

Yeah, I actually built Morte in such a way that you can swap out Morte's default dependency resolver with your own. Morte's default resolver also has a similar no-cycles restriction.

However, I highly suspect that the compressibility of Morte expressions will make or break its suitability for this purpose because right now the data type encodings have very high constant factors over more specialized encodings. I'm going to try running some basic compression algorithms over the serialized expressions and if that does not work then I will try more specialized encodings. In the worst case scenario I could just fork the core language for IPFS to provide built-in support for efficient data types.

davidar commented 8 years ago

Hence bounded computation. turns out there's very good heuristics for this. javascript is turing complete and unbounded computation there is not really a problem.

@jbenet One could argue that JS is also bounded due to the "unresponsive script" warnings that require user intervention. On the flip side of the coin, if you include periodic user intervention as a way to allow unbounded computation, then otherwise bounded languages can be Turing-complete.

Gabriella439 commented 8 years ago

I think it's important to point out that the main benefit of total programming is not to actually restrict the amount of time it takes to run. After all, a finite computation could still take longer than the age of the universe to run and I can very easily write such a computation in a small amount of code, even within a total programming langauge.

The real purpose of a total programming language is to completely eliminate programming errors. Just requiring that the computation eventually finishes eliminates all sorts of silly little mistakes that you would not have otherwise thought of in a non-total language. The zen behind total programming is to completely eliminate all failures completely from your program so that it's completely unbreakable.

So in that sense, having an "unresponsive script" dialog doesn't provide the same benefit as total programming because you're still being permissive about programming errors and all you're doing is transforming them into failed programs. Failed programs are better than programs that hang, but they still failed and therefore they were still useless! An even better solution to the problem is to not fail in the first place, which is the goal behind total programming.

Admin-DataRoads commented 6 years ago

I have actually been thinking about this problem of deterministic transforms as a potential advanced mode of IPFS deduplication and compression.

For example, when transforming the same file compressed by LZW vs. a dictionary method, the result is two different hashes linked back to the same uncompressed file hash. Further, let's say that dictionary compression method utilized a pre-shared dictionary ala SDCH or a separate VCDIFF delta file. Then technically you're actually transforming 2 files (as inputs) as a set back to an original uncompressed file (as output) in a deterministic and somewhat reversible fashion -- compressed + [dictionary|delta] = uncompressed. Tracking these different file hashes as related by some deterministic transform function (via IPNS, IPLD, or other metadata system) then allows you to send or store different pieces opportunistically based on current server and client state, and to further deduplicate between previously stored pieces on each end, meaning each peer pair can negotiate both exchange and storage blocks based on the most efficient wire and storage format available between both. This is similar to how Mercurial or Subversion opportunistically sends compressed Xdeltas over wire instead of full files, whenever both the sender and receiver are already storing a different version of the same file or branch.

An odd thing about VCDIFF style shared dictionaries is that they can have their own delta compressed associations with other dictionaries. One example is a general HTTP dictionary that is a superset of many more specific HTTP file types (eg. UTF-8, JSON, CSS, XML, HTML, XHTML, SVG, etc.), in turn which are each a union of many subsets (eg. different JSON schema), leading to a DAG of VCDIFF or merge transforms connecting between different dictionary entries from specific to more general encoding types. This sort of dictionary-of-dictionaries property leads me to think maybe part of the multiformats and CID effort should be to define a multidict encoding type, separate but related to multicodec, wherein different standardized compression pre-shared-dictionaries (eg. SDCH) can be coordinated between IPFS clusters to avoid inefficient dictionary and uncompressed file resends.

Depending on these dictionaries' encodings, an analysis of all linked uncompressed file hashes in the metadata could also serve as an index search from individual dictionary table entries into the files that contain 1 or more instances of them, even utilizing Levenstein's Distance as both another delta compression connection and a search metric. In other words, the dictionary entry graph could roughly be inverted to form a common-phrase search graph (knowledge compression, so to speak).

Please let me know if any of this is unclear, or if this discussion tangent needs to be moved elsewhere.

fwip commented 6 years ago

For my use-case, I'd love to have filters when storing and retrieving files like git clean / git smudge.

In the bioinformatics field, there are a lot of standard file formats that nearly every tool can read/write without issue. However, these are rarely the most compact representations of that data. There exist specialized tools to losslessly compress this data (taking advantage of the structure inherent or common in the file-format, beyond what general compressors can provide). I would like to run a "storage node" that transparently compresses input files/blocks for storage, and serves the 'regular' versions back over the wire.

IPFS could run the filters on the data: compressing, decompressing, and verifying the hash, before storing the information. I don't have a strong preference for how the filter is invoked, whether it's manual, based on some heuristics, or an exhaustive search of user-defined filters.

RubenKelevra commented 4 years ago

Would be nice if I can add a tar and ipfs will compress the files with for example xz. The file can either be send deflated or compressed, according to the capabilities of the other node.

This would save a lot of disk space, for example for repositories of linux distributions, where many files are similar or just small updates with a low amount of changes.

MatthewSteeples commented 4 years ago

@RubenKelevra similar to what I suggested in #392

kevincox commented 4 years ago

Another use case to consider is IPFS HTTP gateways. If we could upload and store content compressed the gateway could serve it directly to browsers with no re-compression (if the browser supports the compression that was used).

A simple way to do this would be to have the nodes storing the data advertise "alternate formats". The client can then request one of the alternate formats instead of the original. The contract would be that no matter how the data is stored on a given node the node must be able to serve the data "plain". This way new formats can be added without affecting compatibility with older nodes.

For example if my node stores data gzip-compressed I can advertise that I support the deflate alternate format. If the client requests deflate, then I simply serve it. If the client requests plan then I can decompress on the fly.

RubenKelevra commented 4 years ago

@kevincox gzip isn't a great choice for compressing data.

The only modern compression algorithm a typical browser would support is brotli - which isn't a good general purpose CPU/compression trade-off algorithm either, since it's primarily made for http/css/js compression.

Support for brotli would still be nice to be able to compress web-pages stored and served by IPFS.

The best choice for general purpose compression these days is zstd, which has the nice addition of beeing able to support custom dictionaries.

Maybe this could be incorporated into unixfs-v2, to attach a dictionary to a folder, to be able to compress/decompress all files in the folder with the a custom dictionary.

This would increase the compression rate a lot.

kevincox commented 4 years ago

I mostly used gzip as an example. The exact algorithm can be decided after the protocol since we want to support multiple compression formats (especially over time). That being said I would want to keep the number small at the start (just 1 or 2) so that the chance of being able to share the compressed data is high.

I think you are right about brotli, it does have a notable advantage over most deflate compressors (including gzip) and it supported by most modern browsers. I think it makes sense to aim slightly into the future.

The dictionary problem is interesting. It would be nice to support dictionaries however it doesn't seem optimal to tie compressed objects to the folder that contains them. It would be nice if the file itself could reference an appropriate dictionary. This also means that different files in the same directory could use different dictionaries. (For example my HTML and my JSON files could have specific dictionaries).

RubenKelevra commented 4 years ago

@kevincox I thought my idea to the end thru

take a look: https://github.com/ipld/specs/issues/76#issuecomment-606852452

(I think this ticket here can be closed, as we're superseding it in the future with implementation-specific ones.)

ribasushi commented 4 years ago

with implementation-specific ones

Note that while intuitively it seems like the right answer, it is not clear whether this is indeed the case. A lot of tests/numbers need to be produced to determine if the extreme complexity is justified.

For instance my currnt work explicitly does not consider any dataset-specific tricks.

ipfs / notes

Data compression #324