pypi-data / data

Public datasets with per-file infromation about packages uploaded to PyPI.
MIT License
6 stars 0 forks source link

several parqet files are missing from dataset.txt #78

Closed venthur closed 1 week ago

venthur commented 2 months ago

Hi Tom,

since the fix last week, some parquet files are missing from https://raw.githubusercontent.com/pypi-data/data/main/links/dataset.txt, namely everything from 4-9. I thought this was an artifact of your fix yesterday, but today it is the same:

https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-0.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-1.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-10.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-11.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-12.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-13.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-14.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-15.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-16.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-17.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-18.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-19.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-2.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-20.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-21.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-22.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-23.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-24.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-25.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-26.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-27.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-28.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-29.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-3.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-30.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-31.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-32.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-33.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-34.parquet
https://github.com/pypi-data/data/releases/download/2024-09-03-03-05/index-35.parquet

Out of curiosity: Are the parquet files stable over time? I.e. do I have to re-download the whole dataset all the time or are the first parquet files stable and won't change anymore?

orf commented 1 month ago

Hey! Thanks for this - I’ll take a look. Not sure why this is happening.

Unfortunately the files are not stable, and you’ll need to re-download each one.

this isn’t an inherent limitation, I can have a think about how I can make them stable.

orf commented 1 week ago

Just an update to this @venthur - i spent today fixing all this. The datasets are now significantly smaller + less numerous, and contain stable hashes in the file names that allow you to detect if you need to re-download them or not.

See here for an example: https://github.com/pypi-data/data/releases/tag/2024-10-20-16-48

The dataset links should be updated once this build finishes.