sethmlarson / pypi-data

Data about packages and maintainers on PyPI
Apache License 2.0
122 stars 8 forks source link

Package URLs table: not adding package_urls #13

Closed hugovk closed 2 years ago

hugovk commented 2 years ago

Looking at https://pypi.org/project/urllib3/ it has four project URLs:

image

https://pypi.org/pypi/urllib3/json has:

"bugtrack_url": null,
...
"docs_url": null,
"download_url": "",
...
"home_page": "https://urllib3.readthedocs.io/",
...
"project_urls": {
  "Code": "https://github.com/urllib3/urllib3",
  "Documentation": "https://urllib3.readthedocs.io/",
  "Homepage": "https://urllib3.readthedocs.io/",
  "Issue tracker": "https://github.com/urllib3/urllib3/issues"
},

But the database only contains https://urllib3.readthedocs.io/:

$ sqlite3 'pypi.db' 'SELECT package_name, url FROM package_urls WHERE package_name = "urllib3" GROUP BY package_name;'
urllib3|https://urllib3.readthedocs.io/

Looks like the problem is here:

https://github.com/sethmlarson/pypi-data/blob/5056dede268daecee40a4b54fa14beb45bfc546c/main.py#L397-L410

We're adding resp["info"].get("home_page") to project_urls, and this is the one that is in the database.

But when looping this:

        for project_url in resp["info"].get("project_urls") or ():
            project_urls.append(project_url)

We're adding the "Code", "Documentation", "Homepage" and "Issue tracker" keys to the list instead of the URL values (and are then discarded when they fail to parse as URLs).

sethmlarson commented 2 years ago

Thanks for opening this and https://github.com/sethmlarson/pypi-data/issues/14, would you be willing to contribute a PR fixing both of these?

hugovk commented 2 years ago

Yep, will have a look!