ipfs / js-ipfs-unixfs

JavaScript implementation of IPFS' unixfs (a Unix FileSystem representation on top of a MerkleDAG)
Other
87 stars 34 forks source link

Do not throw when Tsize does not equal real DAG size #335

Closed alanshaw closed 8 months ago

alanshaw commented 1 year ago

https://github.com/ipfs/js-ipfs-unixfs/pull/300 started throwing when the exported DAG size did not match the value encoded in the root node.

I've been informed that "there are multiple implementations out there that produce "whatever I feel like" Tsize, if any" and the WIP spec states that this value is optional...as well as:

There is no failure mode known for this field, so your implementation should be able to decode nodes where this field is wrong (not the value you expect), partially or completely missing. https://github.com/ipfs/specs/blob/6c250e676ac2e3d19693e03404c698d56345aad7/UNIXFS.md#tsize--dagsize

Jorropo commented 1 year ago

Note, the spec might not be correct when it says that this is optional, I've since learned that some implementations previous to the spec assume this is not optional. However it is correct when it says that this field is useless and inconsistent.

Jorropo commented 1 year ago

There is no failure mode known for this field, so your implementation should be able to decode nodes where this field is wrong (not the value you expect), partially or completely missing. This also allows smarter encoder to give a more accurate picture (for example don't count duplicate blocks, ...).

rvagg commented 1 year ago

Nice timing, I was only outlining the potential problems with baking in assumptions about this in https://github.com/ipfs/specs/pull/402#pullrequestreview-1434052873, in that case it's about "byte range" requests which are all going to make assumptions about Tsize being according to spec. The ability to produce dodgy values makes this slightly problematic for "trustlessness".

Not advocating either way for this particular issue; the strictness side of me appreciates being forceful with bad data and I've done this in plenty of places (like dag-pb as we were discussing a couple of weeks ago on Slack). But if we have evidence of a nontrivial amount of data that's failing this then we might be stuck with needing to loosen our expectations, or at least providing modes for lax parsing where the user understands the implications and can opt in to alternative behaviour. We have lots of precedents for that (e.g. base64 padded & unpadded handling in dag-json cause of an early mistake in one implementation that got widespread use).

aschmahmann commented 1 year ago

@rvagg as mentioned in the other issue (https://github.com/ipfs/specs/pull/402#discussion_r1201514331) TSize is not about byte ranges and no reasonable implementation would use dag-pb TSize for UnixFS file byte ranges when they have fields for this in the UnixFS data.

Some implementations have ended up with bugs here due to misunderstandings though (e.g. https://github.com/ipfs/go-unixfsnode/pull/34). While it'd be nice for implementations to test against existing fixtures and implementations to build confidence in the implementations and their compatibility, having a clean UnixFS spec to go with fixtures would be much better. However, that's not quite what this issue is about so that'll have to be tackled elsewhere.

achingbrain commented 1 year ago

Tsize isn't consulted in https://github.com/ipfs/js-ipfs-unixfs/pull/300 - the expected size is inferred from the UnixFS file byte ranges stored in the UnixFS metadata.

If we don't throw with under/over reads, the user may get more or less data than they expect which is probably a bad thing. This normally only manifests itself when someone has tried to create their own DAG and got the layout wrong.

Tsize is written by these modules but not read at all really, because it's unreliable and a bit redundant.

achingbrain commented 8 months ago

Closing because we do not throw when Tsize is invalid, only when the UnixFS layout data is invalid which leads to the file data appearing to be corrupted when read.