storacha / w3up

⁂ w3up protocol implementation
https://github.com/storacha-network/specs
Other
60 stars 22 forks source link

CID mismatch with large files #1518

Open tomohiro-n opened 4 months ago

tomohiro-n commented 4 months ago

I've noticed that the CID we pre-calculate for a file and one after it's uploaded to your service can be different. Then I was able to reproduce the exact same mismatch(expected value was what we pre-calculated, actual was the one after upload) with one of your test cases.

Most likely, it depends on the file size. As far as we've checked, the mismatch is produced when the size is > 1.9mb or so.

The uploads a file to the service test in the upload-client package fails by changing as follows.

-    const bytes = await randomBytes(128)
+    const bytes = fs.readFileSync('/path/to/large-file')

I've confirmed that the test passes with a 400kb file.

tomohiro-n commented 4 months ago

More easily, changing to const bytes = await randomBytes(2_000_000) is enough for the test to fail.

StefanoDeVuono commented 2 months ago

Hi! I'm not too familiar with your code base, I took a look at this and found a couple things:

TLDR: Problem Actual code and test code create different sized data streams which then become CARs with different IDs.

Fix (PR 2532): Use UnixFS module's createFileEncoderStream in test helper and actual code and test code will create same sized data streams resulting in same CAR IDs on large files.


Details: The actual code calls CarWriter.create while the test helper calls toCAR. For small byte sizes, like 128, the expected cid and actual cid are the same as desired. Moreover the number of bytes in the expected CAR instance is the same and the actual CAR instance are the same:

128k before

If we use more bytes, the expected CAR instance (2000098) is smaller than the actual one (2000283)

image

The test helper's toCAR method does not chunk the bytes, while the actual code's underlying the UnixFS module, with a max chunk size of 1024 * 1024, splits 2_000_000 into three chunks. So, in the real code data is added to each chunk.

By using the UnixFS module's createFileEncoderStream method to make a chunked stream before making a CAR object, the same headers get added to each chunk and the same CAR gets created (see PR 2532). The tests then pass at both 128 bytes and 2_000_000 bytes.

Hopefully, this is helpful!