pugs / byfs

Beyond File Sharing - Cloud Scale Immutable, Layered, Versioned Dataset Sharing without Copying
1 stars 0 forks source link

Why not use content-addressable storage instead of version? #1

Open eriknordmark opened 1 month ago

eriknordmark commented 1 month ago

The github and docker/OCI has done something interesting in terms of naming read-only snapshots of content, which is to name each one with the cryptographic checksum of its content. Such a name isn't user-friendly and it doesn't even show which versions are older vs newer, but it provides significant benefits. Its comes with the ability to integrity check the content (if you fetch all of the content), and in principle it doesn't matter from where you fetch the content (some origin server, a cache, or some peer computer on the network) since you can verify it is the correct thing.

Then on top of that there can be semantic versions which can be monotonically increasing etc but also other tags which makes sense to users. From a security perspective that semver/tag to sha-256 resolution becomes very sensitive; If I can convince you that the OCI container nginx:1.0.0 corresponds to a different sha and you download and run that you have no idea what software you installed and ran. But it is completely decoupled from the storage and transport of the actual bits.

It might be interesting to explore what this would mean in a storage system where you do not download whole objects (like OCI containers). A thought experiment might be to take a fixed block size system (or a variable size one like ZFS) and see what happens if the blocks were referenced by a sha-384 or sha-512 in the future.

pugs commented 1 month ago

Could be done. The only downside would be the need to scan everything at commit, which would impact performance. Thanks for thinking about this! I'm getting very little feedback overall.

On Fri, Sep 13, 2024 at 8:46 AM Erik Nordmark @.***> wrote:

The github and docker/OCI has done something interesting in terms of naming read-only snapshots of content, which is to name each one with the cryptographic checksum of its content. Such a name isn't user-friendly and it doesn't even show which versions are older vs newer, but it provides significant benefits. Its comes with the ability to integrity check the content (if you fetch all of the content), and in principle it doesn't matter from where you fetch the content (some origin server, a cache, or some peer computer on the network) since you can verify it is the correct thing.

Then on top of that there can be semantic versions which can be monotonically increasing etc but also other tags which makes sense to users. From a security perspective that semver/tag to sha-256 resolution becomes very sensitive; If I can convince you that the OCI container nginx:1.0.0 corresponds to a different sha and you download and run that you have no idea what software you installed and ran. But it is completely decoupled from the storage and transport of the actual bits.

It might be interesting to explore what this would mean in a storage system where you do not download whole objects (like OCI containers). A thought experiment might be to take a fixed block size system (or a variable size one like ZFS) and see what happens if the blocks were referenced by a sha-384 or sha-512 in the future.

— Reply to this email directly, view it on GitHub https://github.com/pugs/byfs/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEODTKZIOQNMT7QYZO4GDZWMCGZAVCNFSM6AAAAABOFUI3VGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGUZDKMJUGE4TGNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

eriknordmark commented 1 month ago

Could be done. The only downside would be the need to scan everything at commit, which would impact performance.

Might not be that much overhead - ZFS does a sha-256 (or is it sha-128?) for every block when written. Key is what is a block and the relationship between blocks and objects.