hugovk / top-pypi-packages

A regular dump of the most-downloaded packages from PyPI
https://hugovk.github.io/top-pypi-packages
223 stars 13 forks source link

Is it possible to add repo name to top-pypi-packages.json? #1

Closed cclauss closed 5 years ago

cclauss commented 6 years ago
  {
    "download_count": 282748018,
    "project": "simplejson",
    "repo": "https://github.com/simplejson/simplejson"
  },
cclauss commented 6 years ago

If the project's "home page" is on github.com or github.io, we can probably make educated guesses.

If not, we can create a yaml file to make that translation. We would need to keep the yaml file up to date as new projects appear.

It also brings up the issue of how to support non-GitHub-based projects like GitLab, etc.

hugovk commented 6 years ago

There could be a post-processing step from top-pypi-packages.json.

open top-pypi-packages.json
for each package in top-pypi-packages:
  if no repo for package:
    fetch JSON from PyPI eg. https://pypi.python.org/pypi/simplejson/json
  if "github" or "gitlab" or something in url:
    mangle the link and store this as repo
  elif "github" or "gitlab" or something in description or long_description:
    extract and mangle the link and store this as repo
save top-pypi-packages.json

Perhaps the mapping of project -> repo would be better in a second JSON file? That way, as projects drop off the bottom and join back at the bottom, they won't be lost and need re-adding. Also any manual corrections or additions won't be lost either.

hugovk commented 5 years ago

I've written a couple of scripts to make a separate JSON file of repos, have a look at:

Currently, it finds 3,951 repos for the top 5,000 packages. I'm not planning on automating this, but can run it from time to time to update it.