pypi-data / data

Public datasets with per-file infromation about packages uploaded to PyPI.
MIT License
6 stars 0 forks source link

What is the "hash" in the metadata dataset exacly, are you sure it is SHA256? #39

Open venthur opened 8 months ago

venthur commented 8 months ago

According to the docs, the metadata-dataset about every file uploaded to PyPI, i.e. the parquet files listed in https://github.com/pypi-data/data/raw/main/links/dataset.txt, contain a SHA256 hash. However, it is not described how the hash is calculated.

When trying to verify that you calculate the SHA256 over the respective file itself, i encountered some issues:

Can you explain, which hash you are using and if you are hashing the contents of the file linked to via the path?

Thank you very much for the awesome dataset!

orf commented 1 month ago

Hey! Thanks for the kind words @venthur. Sorry about the late reply: I had these messages filtered out.

It's been a while since I looked at this project, but you're right: it's not a SHA256 hash. I dug into where I generate this, and for some reason I chose to use Oid::hash_object here, from libgit2.

That was... an unfortunate decision, and I can't see why I chose to do it that way. Parts of this project where pretty experimental, and I was pretty much learning Rust at the time.

So it's going to be a SHA1 hash of blob ${length}\0${content}. Which is actually so inconvenient.

I'll try and think of a way to rectify this.