Closed venthur closed 1 week ago
Hey! Thanks for this - I’ll take a look. Not sure why this is happening.
Unfortunately the files are not stable, and you’ll need to re-download each one.
this isn’t an inherent limitation, I can have a think about how I can make them stable.
Just an update to this @venthur - i spent today fixing all this. The datasets are now significantly smaller + less numerous, and contain stable hashes in the file names that allow you to detect if you need to re-download them or not.
See here for an example: https://github.com/pypi-data/data/releases/tag/2024-10-20-16-48
The dataset links should be updated once this build finishes.
Hi Tom,
since the fix last week, some parquet files are missing from https://raw.githubusercontent.com/pypi-data/data/main/links/dataset.txt, namely everything from 4-9. I thought this was an artifact of your fix yesterday, but today it is the same:
Out of curiosity: Are the parquet files stable over time? I.e. do I have to re-download the whole dataset all the time or are the first parquet files stable and won't change anymore?