TODO: concept page with discussion layout validity

rvagg commented 4 years ago

Arising from an email discussion sparked by @ribasushi - we need somewhere that contains a discussion of multi-block layout validity. Issues include root or parent blocks in a graph making claims about child blocks (sizes, offsets, etc.) that can't be verified without walking the graph--especially acute when you need to walk a large portion of the graph to verify it. e.g. a sharded collection with a size in the root that may not be accurate, or a sharded byte layout like in UnixFSv1 and UnixFSv2 that makes claims in root and intermediate blocks about lengths and/or offsets in leaf blocks that may not be accurate.

There are potential security risks, from as simple as DoS from taking claims on face value and executing more work than necessary based on those claims, to data smuggling, padding and other. It lessens the 1:1 relationship between data and its cryptographic "proof" (hash), which has its own set of risks and costs.

Discussion should include an encouragement to build in validation mechanisms to tooling and even transports where possible. At a minimum it should be something we consider when building - build for adversarial environments where data is being produced by tools other than your own and may even be produced with malicious intent. The introduction of awareness of these costs through APIs is also important (e.g. hamt.SizeHint() or vector.MayOrMayNotBeActualSize() vs blob.CalculateAccurateSizeWithFullGraphWalk() (I'm exaggerating to make the point)).

@ribasushi @warpfork @Stebalien @mikeal @vmx please use this as a place to dump random thoughts on the topic; we can use this as a resource for drafting some more formal text on the subject.

Stebalien commented 4 years ago

In general, we can make validating properties like this efficient by making a cleaver choice about where we declare the error. However, it is important to design our datastructures with validation in mind.

Read through https://github.com/ipfs/go-ipfs/pull/4680 for a bunch of context.

For example:

Size: We can check this when we get to the end of the file. If we haven't read the expected number of bytes, error.
Block sizes: When traversing, validate that:
- For all indirect blocks, the sum of the "blocksizes" equals the filesize.
- The file-size/data-size equals the blocksize specified by the parent.

We can then return a read error when we get a mismatch.

Discussion should include an encouragement to build in validation mechanisms to tooling and even transports where possible

I wouldn't do this. All implementations should report the error in the same place. If we encourage implementations to validate in the transport, some will and some won't. Then, if someone tries to read a valid portion of a file, validating that portion as they go, they'll succeed in some implementations and not others.

Given that requiring full-file validation is not feasible (precludes streaming), I'd actually require late/local validation.

warpfork commented 4 years ago

I wanna drop a quick note here about a word that @rvagg used in passing during one of our weekly calls, and it stuck in my head and seems like it might be good vocabulary to standardize on:

indicators

As in:

Indicators: data in one block which indicates beliefs about data in other blocks that may or may not be true (and cannot be verified until that block is loaded and hash-checked).

warpfork commented 4 years ago

I suspect that "indicator" data will be a recurring theme, both in our own designs and in other user applications, so making some standard, reusable, and easily-linkable set of recommendations (as well as warnings about limitations) might be very useful.

ipld / specs

TODO: concept page with discussion layout validity #233