jantman / pypi-download-stats

ABANDONED - Calculate detailed download stats and generate HTML and badges for PyPI packages
GNU Affero General Public License v3.0
24 stars 10 forks source link

Major cost reductions: Check the clustered tables #22

Open fhoffa opened 5 years ago

fhoffa commented 5 years ago

Feature Request

Feature Description

Searching for a year of data can represent costs of >208GB of scanned data.

With the new clustered tables, this cost can be reduced by ~95%

See https://towardsdatascience.com/python-pypi-stats-in-bigquery-reclustered-d80e583e1bfe.

To use these tables, queries should change from:

SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week
  , REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python
  , COUNT(*) downloads
FROM `the-psf.pypi.downloads2017*`
WHERE file.project='pyspark'
GROUP BY week, python
HAVING python != '3.6' AND week<'2017-12-30'
ORDER BY week

9.0 sec elapsed, 200.88 GB processed # find improvements below

to

SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week
  , REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python
  , COUNT(*) downloads
FROM `fh-bigquery.pypi.pypi_2017`
WHERE project='pyspark'
AND timestamp>'2000-01-01' # nag
GROUP BY week, python
HAVING python != '3.6' AND week<'2017-12-30'
ORDER BY week
5.4 sec elapsed, 9.65 GB processed # winning
jantman commented 5 years ago

Felipe,

Cool, thanks! I honestly can't say when I'm going to get a chance to make that change since I haven't touched this project in a while, and it's not something that I've really noticed as being an issue.

Also, it seems that https://pypistats.org/ provides a free API to do most of what this project does, so there's some question in my mind of whether this project is even still needed.

I'll leave this open though and try to get to it at some point. Thanks for giving me a heads-up.