pypi-data / data

Public datasets with per-file infromation about packages uploaded to PyPI.
MIT License
6 stars 0 forks source link

Capture more PyPI-specific and dependency metadata about packages #12

Open sethmlarson opened 1 year ago

sethmlarson commented 1 year ago

Hello @orf, I absolutely love https://py-code.org! Thank you for creating this service.

I manually maintain my own dataset about Python packages available on PyPI (but more around dependency metadata and PyPI-specific information like maintainers). Do you have any interest in supporting these use-cases? Would happily stop maintaining my own dataset and point to py-code if this information is made available (your dataset is much more automated and has a nice frontend :sparkles:)

Let me know what you think, and thanks again!

orf commented 1 year ago

Hey! I absolutely do, something like this is the next phase of the "pypi-data cinematic universe". I have have some of this raw data already captured from pypi, but it seems you have enriched it a bit.

Right now we have a few disconnected pieces that we can jam together to do cool things:

  1. We have the raw pypi JSON data on releases
  2. We have all the code
  3. We have metadata on the contents of pypi archives

With this you can:

  1. Find the unique git OIDs of all some-interesting-file-name.py files, or others by a specific pattern
  2. Fetch and parse the contents of those files to extract some interesting metrics, producing a mapping of {git_oid: stats}
  3. Turn the mapping of {git_oid: stats} to {(project_name, project_version): stats} using the git_oid and the datasets in this repo
  4. Turn {(project_name, project_version): stats} into anything, by joining the (project_name, project_version) on another dataset (like yours)

So with this we could parse all .py files, count the number of classes, and plot "classes written over time, segmented by PyPI trove classifier/other pypi metadata/number of downloads/maintainer/whatever".

The problem is that this is all disconnected and a bit shit. I want this to be relatively seamless because I'm sick of doing it manually 😂.

I'm working on a CLI tool to handle step 1, 2 and 3 for users, but step 4 is pretty interesting.

Perhaps we could take the pypi-json-data dataset, enrich it a bit and provide it in some format that can be used as part of this workflow?

That data could also be explorable via py-code.org, I've been thinking of adding some info from pypi-json-data to the site. not sure what format it should be in though.