Fix how Warehouse stores metadata (per-file, not per-release)

Task list:

[ ] Add a FileMetadata model and an optional 1-1 relationship to the File model. The fields of this model should correspond to the available Core Metadata fields.
[ ] Add a metadata_source field to FileMetadata that lets us determine if the metadata is "provided" (i.e. from the POST) or "extracted" (i.e. from the artifact itself).
[ ] Modify the upload endpoint to start creating FileMetadata objects on new uploads, discerning how we got the metadata (wheels should always be "extracted", source distributions will be "provided")
[ ] Write a task to backfill FileMetadata files for wheels that have already been uploaded with data from our metadata backfill files (#15526 will be helpful here)
[ ] Create a way to select the "metadata source" for a release (the oldest file for that release)
[ ] Drop all the various metadata fields on the Release model and update the UI to get them from the "metadata source" instead.

Original issue

Describe the bug Warehouse's API gives the user sometimes inaccurate dependency metadata for a project's release, because:

we only store the dependency information per Release, but it can vary per file.
first file uploaded creates the Release object. which is also problematic -- if the user uploads a source distribution first, no dependency information is encoded for the Release

Expected behavior As I understand it, we should change how we store and provide dependency information, recording and storing it per file instead of per release. I presume this means that the requires_dist field within the release endpoint would move from the "info" value to the individual "releases" values.

To Reproduce Sorry, I don't have one to hand.

Additional context Quoting a conversation in IRC today between @dstufft and @techalchemy (quotes are condensed and edited for grammar):

@dstufft said of the current Warehouse JSON API, "I don't think it's usable in it's current form for resolving dependencies". Regarding the metadata available, which clients would otherwise need to download packages to acquire, "the data is wrong is the main thing ... for dep info .... because warehouse (and originally PyPI's data model) is wrong. We only store the dependency information per Release, but it can vary per file."

@techalchemy asked: "so which file do you pick for parsing dependencies? the first wheel that gets uploaded? Or the last one?" @dstufft: "first file uploaded creates the Release object. which is also problematic, if you upload a sdist first no dependency information is encoded.... At one point only twine worked to upload dependency information. If you uploaded with setuptools it didn't get sent no matter what."

Donald also noted, on parseability of that info, "We [Warehouse] do not currently parse anything inside of a wheel, in part because we never did, in part because upload already takes forever and the more stuff we do the longer it takes. I think our timeout on upload is multiple minutes, because that's how long it takes sometimes." (That's a reason for #7730 but we should not block on that.)

"We might want to tweak the JSON API a bit just to make it suitable for the primary use case I think people want it for, and when I say tweak, I basically mean add a field or two to a dict inside of alist"

@cooperlees is now working on https://github.com/pypa/packaging-problems/issues/367 which is partially blocked by this bug in Warehouse.

I've been thinking about this again, so I looked at exactly what we're storing per Project, Release, and File in terms of metadata.

Currently our metadata looks like:

Project: Name
Release: Version
Release: Author
Release: Author Email
Release: Maintainer
Release: Maintainer Email
Release: Home Page
Release: Download-URL
Release: Project-URL
Release: License
Release: Summary
Release: Description
Release: Keywords
Release: Platform
Release: Requires-Python
Release: Yanked / Yanked Reason
Release: Classifiers
Release: Uploader / Uploaded-Via
Release: Requires / Provides / Obsoletes
Release: Requires-Dist / Provides-Dist / Obsoletes-Dist
Release: Requires-External
File: Python Version
File: Requires-Python (Denormalized from Release via PostgreSQL)
File: Package-Type
File: Comment
File: Filename
File: Signature
File: MD5 / SHA256 / Blake2-256 Digests
File: Upload-Time
File: Uploaded-Via (Duplicated during Upload via Warehouse)

Since all of this metadata comes from uploading a file, technically every single one of these besides Name and Version can vary file to file, but practically speaking much of it is exactly the same for each file within a release.

Additionally, from an implementation point of view, it's a lot easier in the Web UI, and also just general data model and conceptual overhead when most of these values are the same for each release.

Looking though this, the main thing I see that we should change is:

Requires / Provides / Obsoletes should move from Release to File
Requires-Dist / Provides-Dist / Obsoletes-Dist should move from Release to File
Requires-External should move from Release to File
Uploader / Uploaded-Via should exist on both Release and File

You could make an argument that Requires-Python and Yanked / Yanked Reason should move from Release to File.

I think that for Requires-Python, we display that in the Web UI, so it makes sense to want that to be consistent, but more importantly, the intent of that key is that you can make a new release that doesn't support some version(s) of Python without a new release breaking existing users. For that, I think it makes sense to keep it consistent, since every file should be the same version of the code.

For yanking, I think that we've made a conscious choice to implement yanking in terms of a yanking a whole release, not an individual file, which I think is perfectly fine.

So I believe the original issue was roughly accurate, we primarily just need to store dependency information on File, and the rest can pretty much remain on Release. I assumed that would be the case, but I wanted to take a few minutes to dig into it to verify it.

WARNING: I've done a fair amount of thinking on this, and in an effort to get that information out of my head I'm just going to brain dump on this comment.

There are a few inter playing concerns here:

The existing metadata stored in the Warehouse DB does not come from the files that are uploaded to Warehouse, but rather as part of the upload we expect the uploader to POST the data alongside the file, and we store that instead.
- This means that an uploader is responsible for providing correct information to Warehouse, with nothing actually checking that, so it's a potential place for incorrect clients to cause different installers to do things differently.
- One of the goals for this metadata is to present it to installers, but currently most installers rely on the metadata in the file, if we just suddenly switch to storing the POST metadata per file those installers will switch from relying on the file metadata to the database metadata.
Not all of the artifacts have metadata that is parseable from the file, particularly sdists.
- This is true even for new style sdists which have a standard for static metadata, because that standard has an escape hatch that lets a sdist mark a particular bit of metadata as dynamic, even in the static metadata.
Warehouse doesn't want to provide a UI that treats each file as it's own distinct entity, but wants to provide a UI that treats the release as the distinct entity, that holds a number of files.
- This doesn't mean that we won't want to move some of the metadata that we treat as release wide, to be file specific in the Warehouse UI.
The primary driving force for this data is to present it to installers, so we need to make sure how we solve this, will work for them.

The actual mechanics of storing metadata per file is straight forward, either add columns to the existing File model, or create a FileMetdata model that has a foreign key to the File.

Most of these problems come down to how do we source this metadata (for existing files, for new files in the interim, and long term), but there's also a question on how we handle sdists which may have dynamic metadata.

Personally, I think the gold standard for what we want to arrive at here is that we have a source of per file metadata, that we guarantee matches the content of the files. If we can't get that information for a file, then we don't attempt to "best guess it", it either comes from the file or we don't provide it. An important aspect here is we'll need to determine the difference between unset and set to nothing.

In the interim to getting to that point, we can have some "best guess" information, if we know that we can later fill it in, but we shouldn't provide "best guess" information, and then delete it completely.

I think the fastest path forward then is we end up with a FileMetadata model (or columns on File), that we only fill in for Wheels currently from the POST, because we know that wheels have static metadata so we can always backfill it. However we don't know that for sdists, so we don't try to guess based on the POST data.

Once we evolve the upload API so that we can introspect artifacts and get metadata from them, then we can switch to using the from-the-artifact data, including in cases where we have that information inside of an sdist.

We can also backfill this "best guess" data at anytime by iterating over the wheels (and actually sdists) that are already uploaded and fill in t hat data with "validated" data. We could do this after we have the upload introspection, or even before if we record that we've done it already.

Then there's the question of handling the metadata that we want to treat as "release" metadata in the Warehouse UI and existing APIs, but which obviously come from one of the files (like summary, long description, URLs, whatever). For this we can do a number of things, either just duplicate the data at the Release level and leave the existing columns there, or have some file end up being promoted as the "metadata source" for that Release (could be done using some logic like first release, or always the sdist, or whatever, or even just an FK and do first uploaded).

So tl;dr I suspect our best path forward here:

Add a FileMetdata or more columns on File, figure out how to mark unset vs nothing, figure out how to mark if it comes from within an artifact or via the POST data.
Start storing that data, only for wheels via the POST data, marked appropriately.
Backfill the above with data from Release only for Wheels.
Optional: Celery task that looks for files that have not been extracted from the artifact, and does that, marking appropriately.
- This might want to validate the metadata strictly before doing extraction?
Either keep storing all the various Release columns, or make a FK or something to still let referencing one of the FileMetadata instances.

Then at some point in the future, when we can introspect files on upload, we get validated data on upload.

It's been a while and a few things have changed since https://github.com/pypi/warehouse/issues/8090#issuecomment-1412819768 (namely that we've started extracting metadata from wheels, and have backfilled metadata files for all wheels).

I've updated the first comment (https://github.com/pypi/warehouse/issues/8090#issue-636568899) with a task list that I think accurately describes the work we need to do here.

pypi / warehouse

Fix how Warehouse stores metadata (per-file, not per-release) #8090