hugovk / pypi-tools

Command-line Python scripts to do things with PyPI
https://hugovk.github.io/pypi-tools
23 stars 2 forks source link

source_finder.py and top_repos.py #10

Closed hugovk closed 4 years ago

hugovk commented 4 years ago

source_finder.py

Given a PyPI package, source_finder.py looks for the source repository in its metadata.

$ python source_finder.py six
https://github.com/benjaminp/six
$ python source_finder.py urllib3
None

It caches the JSON metadata downloaded from PyPI in a temporary directory, use the --verbose option to see where. The cache files will be deleted the next month.

$ python source_finder.py s3transfer --verbose
API URL: https://pypi.org/pypi/s3transfer/json
Cache file: /Users/hugo/Library/Caches/source-finder/2019-10-https-pypi-org-pypi-s3transfer-json.json
Cache file exists
project_urls    Homepage        https://github.com/boto/s3transfer
Success!
project_urls    Homepage        https://github.com/boto/s3transfer
Success!
https://github.com/boto/s3transfer

top_repos.py

This will look for the source repo for the top-5,000 most-downloaded packages, using a JSON file from Top PyPI Packages, and save them to data/top-repos.json.

First, fetch fresh copy of the top packages:

$ wget https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json -O  data/top-pypi-packages.json

--2019-10-14 18:12:45--  https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json
Resolving hugovk.github.io (hugovk.github.io)... 185.199.110.153, 185.199.108.153, 185.199.111.153, ...
Connecting to hugovk.github.io (hugovk.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250885 (245K) [application/json]
Saving to: ‘data/top-pypi-packages.json’

data/top-pypi-packages.json      100%[========================================================>] 245.00K  --.-KB/s    in 0.02s

2019-10-14 18:12:45 (14.7 MB/s) - ‘data/top-pypi-packages.json’ saved [250885/250885]

Check the first 10 packages:

$ python top_repos.py -n 10
Load data/top_repos.json...
Load top-pypi-packages.json...
Already done: 0
Find new repos...
1 urllib3
2 six       https://github.com/benjaminp/six
3 requests
4 botocore  https://github.com/boto/botocore
5 python-dateutil
6 certifi
7 s3transfer        https://github.com/boto/s3transfer
8 pip
9 idna      https://github.com/kjd/idna
10 docutils
Old repos: 0
New repos: 4
Not found: 6
Save data/top_repos.json...

When running again:

Currently, it finds 3,951 repos for the top 5,000 packages.

I'm not planning on automating this, but can run it from time to time to update it.