ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.05k stars 3.01k forks source link

`add --cid-version=1 --raw-leaves=false` isn't hash coherent with `add --cid-version=0` #8974

Closed Jorropo closed 1 year ago

Jorropo commented 2 years ago

Checklist

Installation method

built from source

Version

7892cc91f9ed17f5a6e0348334ed09c8bdb3194f

Config

N/A

Description

When using --cid-version=1 ipfs will use CIDv1 CIDs in the root blocks of the file, creating different hashes if the file gets chunked into multiple leaves.

Reproduction

dd if=/dev/urandom count=1 bs=1024 iflag=fullblock of=f
[[ $(ipfs cid base32 $(ipfs add -n --chunker=size-512 -Q --cid-version=0 f)) != $(ipfs add -n --chunker=size-512 -Q --cid-version=1 --raw-leaves=false f) ]] && echo "Failed, did not matched!" || echo "Success matched!"
Jorropo commented 2 years ago

Personally that a WONTFIX (except maybe some docs), having to use --cid-version=0 and then ipfs cid base32 if you care about backward compat doesn't look too bad. I belive we just have to be carefull that this behaviour doesn't happen with #4143.

ribasushi commented 2 years ago

@Jorropo this isn't a bug - the version of a cid is part of the payload. A block with links in cidv0 is different byte-content ( thus different CID ) as opposed to a block with the very same links in cidv1.

Jorropo commented 2 years ago

@ribasushi I know, that why the:

Personally that a WONTFIX

It's just that go-ipfs attempt to have some hash coherency with future versions. (adding the same data multiple times with the same options in future versions should generally give you the same CID back) We have to choose whether this qualify or not. And whether doing ipfs cid base32 $(ipfs add -n --cid-version=0) is a valid alternative. As well as document this alternative.

ribasushi commented 2 years ago

You are probably thinking of --upgrade-cidv0-in-output=true. Note that this was raised ( and rejected ) in https://github.com/ipfs/go-ipfs/issues/4143#issuecomment-811050725

Jorropo commented 2 years ago

You are probably thinking of --upgrade-cidv0-in-output=true.

I'm essentially proposing the same thing but doing it explicitly with ipfs cid base32 (which already exists and works), I think the fix for this issue is just gonna be some more lines of docs somewhere.

schomatis commented 2 years ago

Always happy to get more docs but is this coherency expected by the user somewhere? (And what exactly is the meaning of it?) It seems reliant on very specific and internal knowledge on how we construct these (raw leaves, chunker, etc.).

Jorropo commented 2 years ago

is this coherency expected by the user somewhere?

I have talked with a few people that assume it. We also have tried to somewhat maintain it.

And what exactly is the meaning of it?

Adding the same file multiple times yields the same CID (or at least a compatible one, so really it's only the same multihash).

It seems reliant on very specific and internal knowledge on how we construct these (raw leaves, chunker, etc.).

Yeah

schomatis commented 2 years ago

Pushing some more on your definition to try to better understand it and see how we can fit it in the docs:

Adding the same file multiple times yields the same CID

You're changing the arguments, is the user that changes these still expect the same CID?

a compatible one

Then what does compatible mean? Same multihash? Is the user aware of it, and if it is, it expects it to go unchanged when it changes the CID or any other argument?

If anything this seems more of an issue with the UnixFS layer and explaining to the user that their file/directory/etc representation in IPFS is opaque and they shouldn't rely on the CID/multihash. We attempt to change this as little as possible but there are (AFAIK) no guarantees. For example, we now automatically change the directory implementation (from basic to HAMT) based (roughly) on its number of entries, when they reach a (variably-adjusted) threshold. There is no single "file" entity in IPFS as a continuous string of bits as (say) in the user's hard drive. I'd push for more documentation in that direction.

schomatis commented 2 years ago

(Changing to a docs issues unless there is in fact a guarantee of CID/multihash stability in UnixFS.)

aschmahmann commented 2 years ago

I have talked with a few people that assume it.

IMO there is in no way whatsoever a guarantee that ipfs-vX add foobar generates the same DAG as ipfs-vX+1 add foobar.

From ipfs add --help

https://github.com/ipfs/go-ipfs/blob/a72753bade90c4a48c29aba6c0dc81c44785e9d2/core/commands/add.go#L110-L113

If that wording is insufficient to get the point across we can modify that as well.

We also have tried to somewhat maintain it.

Correct, we don't like changing behavior because no matter what you tell people they'll start relying on things anyway unless it evolves too quickly for people to incorrectly assume it'll never change. This is part of why the defaults for ipfs add have not been changed to a superior DAG structure that uses raw leaves (e.g. ipfs add --cid-version=1 --raw-leaves=true (yes, I know --cid-version=1 implies raw leaves).

When the change to CIDv1 by default happens there will likely need to be a set of comms reminding people that defaults can change and how to generate CIDv0s if they really need to.

@Jorropo unless the idea is you want to add more text to ipfs add this should be closed in favor of an issue in https://github.com/ipfs/ipfs-docs/issues

Jorropo commented 2 years ago

If anything this seems more of an issue with the UnixFS layer and explaining to the user that their file/directory/etc representation in IPFS is opaque and they shouldn't rely on the CID/multihash. We attempt to change this as little as possible but there are (AFAIK) no guarantees. For example, we now automatically change the directory implementation (from basic to HAMT) based (roughly) on its number of entries, when they reach a (variably-adjusted) threshold. There is no single "file" entity in IPFS as a continuous string of bits as (say) in the user's hard drive. I'd push for more documentation in that direction.

I agree with all of this.

I think the issue is getting side tracked.

The issue here is that it is surprising that the CID version change the hash. It seems like a non trivial amount of people (me including before I implemented some raw CID code) think that CID version is just an encoding trick, but the binary data wouldn't change, just how it gets made into text.

unless the idea is you want to add more text to ipfs add

I want to add a

This also affects CID's binary representation inside blocks and could change your hashes.

similar thing to --cid-version's help text.

schomatis commented 2 years ago

I think the implicit assumption that is causing some confusion here is we're talking about a single block (strings of bits) of data.

In general changing CID version doesn't change the hash of the data (because we keep the same hashing algorithm, at least in v1).

But a file is (usually) not a single block of data but a DAG of them, and a DAG is not just the concatenation of the chunks of the original file but something more opaque that includes other metadata to structure it. (That metadata is in fact leaked in the CID when the user might not be actually interested in it.)

The bold text is not obvious to a new user (I had the same confusion for a lot of time) and some more docs could be added on that.

Jorropo commented 1 year ago

I had discussions since then, we ideally want to sunset CIDv0, this will never happen but we are still gonna try.

We want to default to encoding CIDv1 inside the binary blocks even when a CIDv0 could be used because it makes it anoying to write various implementations and have weird edge cases in some of them.