Capture more PyPI-specific and dependency metadata about packages

Hey! I absolutely do, something like this is the next phase of the "pypi-data cinematic universe". I have have some of this raw data already captured from pypi, but it seems you have enriched it a bit.

Right now we have a few disconnected pieces that we can jam together to do cool things:

We have the raw pypi JSON data on releases
We have all the code
We have metadata on the contents of pypi archives

With this you can:

Find the unique git OIDs of all some-interesting-file-name.py files, or others by a specific pattern
Fetch and parse the contents of those files to extract some interesting metrics, producing a mapping of {git_oid: stats}
Turn the mapping of {git_oid: stats} to {(project_name, project_version): stats} using the git_oid and the datasets in this repo
Turn {(project_name, project_version): stats} into anything, by joining the (project_name, project_version) on another dataset (like yours)

So with this we could parse all .py files, count the number of classes, and plot "classes written over time, segmented by PyPI trove classifier/other pypi metadata/number of downloads/maintainer/whatever".

The problem is that this is all disconnected and a bit shit. I want this to be relatively seamless because I'm sick of doing it manually 😂.

I'm working on a CLI tool to handle step 1, 2 and 3 for users, but step 4 is pretty interesting.

Perhaps we could take the pypi-json-data dataset, enrich it a bit and provide it in some format that can be used as part of this workflow?

That data could also be explorable via py-code.org, I've been thinking of adding some info from pypi-json-data to the site. not sure what format it should be in though.

pypi-data / data

Capture more PyPI-specific and dependency metadata about packages #12