pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 964 forks source link

BigQuery `bigquery-public-data.pypi.distribution_metadata` missing data #16008

Open mensfeld opened 6 months ago

mensfeld commented 6 months ago

Running this query:

SELECT 
  name,
  version,
  summary,
  description
FROM 
  `bigquery-public-data.pypi.distribution_metadata`
WHERE 
  name = 'virtualenv'
ORDER BY 
  version;

misses several new versions available here: https://pypi.org/project/virtualenv/#history released in April and May. It's similar for some other packages.

Describe the bug

All versions info should be available in BigQuery.

Expected behavior

I would expect them (except eventual consistency ofc) to be available in BQ.

To Reproduce

Run in BigQuery:

SELECT 
  name,
  version,
  summary,
  description
FROM 
  `bigquery-public-data.pypi.distribution_metadata`
WHERE 
  name = 'virtualenv'
ORDER BY 
  version;

and see versions are missing.

ewdurbin commented 6 months ago

The task that ensures consistency was disabled due to poor performance in... 2021 🙃

https://github.com/pypi/warehouse/pull/10256

But was never subsequently re-enabled that I can tell, as the contributor never returned to address the issue.

For triage, I have manually run this task, can you confirm if you're seeing consistency?

mensfeld commented 6 months ago

@ewdurbin was all the data synced? That is, should all the historical gaps be filled now?

When I query virtualenv I'm still missing 20.25.2+ versions (anything newer).

Is there any other endpoint to get the recent releases data?

mensfeld commented 5 months ago

@ewdurbin I'm still not seeing the newer releases of virtualenv in the BigQuery dataset :(

ewdurbin commented 5 months ago

Hmmm, unclear what the issue is. @di are you familiar with why the sync wouldn't capture past releases?

di commented 5 months ago

That's not the job that inserts new metadata, that job just syncs missing metadata if insertion fails for some reason.

Insertion of new metadata happens on upload: https://github.com/pypi/warehouse/blob/main/warehouse/forklift/legacy.py#L1222-L1223

The timeline here is suspiciously close to when we did some migrations on these schemas, my guess is that the update_bigquery_release_files‎ is failing and we're unaware.

ewdurbin commented 5 months ago

So sync_bigquery_release_files is not the bulk equivalent of update_bigquery_release_files‎?

di commented 5 months ago

It is, but it shouldn't be necessary anymore, metadata should be reliably getting inserted on upload (but it appears it isn't anymore).

ewdurbin commented 5 months ago

hm, okay I ran sync_bigquery_release_files in an attempt to triage and it seems it didn't bulk load missing info. seems this needs some more investigation.

di commented 5 months ago

Probably failing for the same reason the individual job is failing I would venture a guess!