textileio / go-threads

Server-less p2p database built on libp2p
MIT License
451 stars 65 forks source link

Your Paper has a wrong definition of the CID in ipfs #557

Closed frank-dspeed closed 2 years ago

frank-dspeed commented 2 years ago

Your paper claims that a CID is based on the file content and that is true it also claims that the same content creates the same hash and that is only partial true as the CID also Includes the IPFS node DHT Information.

A File only gets the same hash when it has the same content and gets created on the same IPFS Node else it is none determenistic that means it is not getting the same hash as it is not a content hash.

I am working on Solutions for the same problems thats why i saw that mistake in your paper i am currently thinking if ipfs would be possible with real predictable determenistic hashing.

https://docsend.com/view/gu3ywqi at the top of page 4 you claim that the same content on diffrent peers produces the same hash this is the missleading part i guess.

it should be the content hash is able to get the same content no matter from which peer it comes.

as again creating a ipfs hash on diffrent nodes from the same content ends up in 2 diffrent CID's

merlinran commented 2 years ago

No CID is purely based on the file content. You can generate a CID for the content without having anything IPFS related.

jsign commented 2 years ago

As an extra question reg:

.. the CID also Includes the IPFS node DHT Information.

Can you provide the reference where you read that? That claim isn't true. It feels to me there might be some confusion.

frank-dspeed commented 2 years ago

@jsign you can verify that by creating files on diffrent nodes but if you want the full details

Using --raw-leaves (implied by --nocopy, iirc) or --inline should also change the CID (but it might depend on the file content).

my conclusion is that we can never think that all are running the same version with same settings so we have no Deduplikation

jsign commented 2 years ago

@frank-dspeed, thanks for the details. For sure we can talk about the details, that's where the fun is. :)

First, is true that under different DAG creation configurations you'd get a different cid. (i.e: change hashing alg, dag layout, raw leaves, etc), but that's unrelated to your original question because you have a definition of DHT which isn't correct. This is why I said I was smelling some confusion here, and I was suspecting you wanted to refer to another concept.

I think in your original question when you said: "between different IPFS nodes" or "DHT", those things are irrelevant or wrong, since what matters is the DAG creation configuration. To be more verbose, here're some claims:

my conclusion is that we can never think that all are running the same version with the same settings so we have no Deduplikation

In general, most people in the space use ipfs add which has the same default values since the ~begining. Mostly to avoid the same problem you're mentioning. If someone is changing the DAG creation configuration, they should probably know what they're doing and understand that will change the Cid of the data for other people just running ipfs add.

If you want to be 100% strict on saying that we should clarify adding in our paper: "under the assumption of always using the same DAG building configuration", I think is a fair point. That's something not usually clarified every time someone wants to talk about leveraging content-hashing, since talking about content-addressing always implies having baked in a stable address creation scheme. If you have f(data) = address, I think is fair to say nobody should expect f to be changed in the middle of an argument.

frank-dspeed commented 2 years ago

@jsign your correct add that part. You should not underestimate the number of People without prerequired knowledge that read the paper.

I think we can assume that someone who uses this software is not in general familiar with the deep implications of content addressing in general.

jsign commented 2 years ago

@frank-dspeed, thanks for your feedback!