mamba-org / rattler

Rust crates to work with the Conda ecosystem.
BSD 3-Clause "New" or "Revised" License
226 stars 42 forks source link

feat: Only save md5 in lock file if no sha256 is present #764

Closed Hofer-Julian closed 3 days ago

Hofer-Julian commented 3 days ago

By saving only one of those two hashes, we reduce the lock file size a bit. This should be especially noticeable in git repos, since hashes compress poorly.

This PR is a contribution to improving https://github.com/prefix-dev/pixi/issues/1509

tdejager commented 3 days ago

Could you give some statistics on file size or compressed file reductions for: https://github.com/rerun-io/rerun (e.g)?

tdejager commented 3 days ago

@Hofer-Julian Maybe something like this to test the compression?

import zlib

# Step 1: Read the file contents
input_file_path = 'input.txt'
with open(input_file_path, 'rb') as f:
    file_data = f.read()

# Step 2: Compress the contents
compressed_data = zlib.compress(file_data)

# Step 3: Save the compressed data
output_file_path = 'output.txt.zlib'
with open(output_file_path, 'wb') as f:
    f.write(compressed_data)

print(f"Compressed file saved as {output_file_path}")
Hofer-Julian commented 3 days ago

Could you give some statistics on file size or compressed file reductions for: https://github.com/rerun-io/rerun (e.g)?

Here you go:

nu ❯ ls --du | sort-by size | reverse
╭───┬────────────────────────────┬──────┬───────────┬───────────────╮
│ # │            name            │ type │   size    │   modified    │
├───┼────────────────────────────┼──────┼───────────┼───────────────┤
│ 0 │ conda-lock_with_md5        │ file │   1.7 MiB │ 8 minutes ago │
│ 1 │ conda-lock_without_md5     │ file │   1.6 MiB │ 8 minutes ago │
│ 2 │ conda-lock_with_md5.zip    │ file │ 241.0 KiB │ 2 minutes ago │
│ 3 │ conda-lock_without_md5.zip │ file │ 199.0 KiB │ 2 minutes ago │
╰───┴────────────────────────────┴──────┴───────────┴───────────────╯

The fraction between with and without md5 not zipped is 0.94. The fraction between with and without md5 zipped is 0.82.

That indicates that the impact is indeed higher for compressed situations.

tdejager commented 3 days ago

That's pretty significant for the compressed case :)

baszalmstra commented 2 days ago

@tdejager @Hofer-Julian The reason we store all the metadata including the md5 and the sha256 is because a matchspec can depend on these values. E.g you can have torch[md5=123456789]. That is also the reason we store the license etc. I know this is an edge case but by removing this we can no longer statitically verify the lock-file correctly. We also break incremental updates.

tdejager commented 2 days ago

Do we have examples of this being used?

baszalmstra commented 2 days ago

Not from the top of my head no. But if we remove this field we can also remove a bunch of others: e.g. size, file_name, timestamp, etc.

But I would be opposed to that because it would potentially break a future incremental update of a lock-file. We have been advocating for storing the entire repodata record in the lock-file for exactly this reason.

baszalmstra commented 2 days ago

See also the design goals and text about conparison with conda-lock: https://docs.rs/rattler_lock/latest/rattler_lock/#relation-to-conda-lock

tdejager commented 1 day ago

Yes you are right, we might need to come up with alternative solutions though, we could revert this.

Because the large lock files are a problem :)