Open brainwane opened 4 years ago
@cooperlees is now working on https://github.com/pypa/packaging-problems/issues/367 which is partially blocked by this bug in Warehouse.
I've been thinking about this again, so I looked at exactly what we're storing per Project
, Release
, and File
in terms of metadata.
Currently our metadata looks like:
Since all of this metadata comes from uploading a file, technically every single one of these besides Name and Version can vary file to file, but practically speaking much of it is exactly the same for each file within a release.
Additionally, from an implementation point of view, it's a lot easier in the Web UI, and also just general data model and conceptual overhead when most of these values are the same for each release.
Looking though this, the main thing I see that we should change is:
Release
to File
Release
to File
Release
to File
Release
and File
You could make an argument that Requires-Python and Yanked / Yanked Reason should move from Release
to File
.
I think that for Requires-Python, we display that in the Web UI, so it makes sense to want that to be consistent, but more importantly, the intent of that key is that you can make a new release that doesn't support some version(s) of Python without a new release breaking existing users. For that, I think it makes sense to keep it consistent, since every file should be the same version of the code.
For yanking, I think that we've made a conscious choice to implement yanking in terms of a yanking a whole release, not an individual file, which I think is perfectly fine.
So I believe the original issue was roughly accurate, we primarily just need to store dependency information on File
, and the rest can pretty much remain on Release
. I assumed that would be the case, but I wanted to take a few minutes to dig into it to verify it.
WARNING: I've done a fair amount of thinking on this, and in an effort to get that information out of my head I'm just going to brain dump on this comment.
There are a few inter playing concerns here:
POST
the data alongside the file, and we store that instead.
POST
metadata per file those installers will switch from relying on the file metadata to the database metadata.The actual mechanics of storing metadata per file is straight forward, either add columns to the existing File
model, or create a FileMetdata
model that has a foreign key to the File
.
Most of these problems come down to how do we source this metadata (for existing files, for new files in the interim, and long term), but there's also a question on how we handle sdists which may have dynamic metadata.
Personally, I think the gold standard for what we want to arrive at here is that we have a source of per file metadata, that we guarantee matches the content of the files. If we can't get that information for a file, then we don't attempt to "best guess it", it either comes from the file or we don't provide it. An important aspect here is we'll need to determine the difference between unset and set to nothing.
In the interim to getting to that point, we can have some "best guess" information, if we know that we can later fill it in, but we shouldn't provide "best guess" information, and then delete it completely.
I think the fastest path forward then is we end up with a FileMetadata
model (or columns on File
), that we only fill in for Wheels currently from the POST
, because we know that wheels have static metadata so we can always backfill it. However we don't know that for sdists, so we don't try to guess based on the POST
data.
Once we evolve the upload API so that we can introspect artifacts and get metadata from them, then we can switch to using the from-the-artifact data, including in cases where we have that information inside of an sdist.
We can also backfill this "best guess" data at anytime by iterating over the wheels (and actually sdists) that are already uploaded and fill in t hat data with "validated" data. We could do this after we have the upload introspection, or even before if we record that we've done it already.
Then there's the question of handling the metadata that we want to treat as "release" metadata in the Warehouse UI and existing APIs, but which obviously come from one of the files (like summary, long description, URLs, whatever). For this we can do a number of things, either just duplicate the data at the Release
level and leave the existing columns there, or have some file end up being promoted as the "metadata source" for that Release
(could be done using some logic like first release, or always the sdist, or whatever, or even just an FK and do first uploaded).
So tl;dr I suspect our best path forward here:
FileMetdata
or more columns on File
, figure out how to mark unset vs nothing, figure out how to mark if it comes from within an artifact or via the POST
data.POST
data, marked appropriately. Release
only for Wheels.Release
columns, or make a FK or something to still let referencing one of the FileMetadata
instances.Then at some point in the future, when we can introspect files on upload, we get validated data on upload.
It's been a while and a few things have changed since https://github.com/pypi/warehouse/issues/8090#issuecomment-1412819768 (namely that we've started extracting metadata from wheels, and have backfilled metadata files for all wheels).
I've updated the first comment (https://github.com/pypi/warehouse/issues/8090#issue-636568899) with a task list that I think accurately describes the work we need to do here.
Task list:
FileMetadata
model and an optional 1-1 relationship to theFile
model. The fields of this model should correspond to the available Core Metadata fields.metadata_source
field toFileMetadata
that lets us determine if the metadata is "provided" (i.e. from thePOST
) or "extracted" (i.e. from the artifact itself).FileMetadata
objects on new uploads, discerning how we got the metadata (wheels should always be "extracted", source distributions will be "provided")FileMetadata
files for wheels that have already been uploaded with data from our metadata backfill files (#15526 will be helpful here)Release
model and update the UI to get them from the "metadata source" instead.Original issue
Describe the bug Warehouse's API gives the user sometimes inaccurate dependency metadata for a project's release, because:
Expected behavior As I understand it, we should change how we store and provide dependency information, recording and storing it per file instead of per release. I presume this means that the
requires_dist
field within the release endpoint would move from the "info" value to the individual "releases" values.To Reproduce Sorry, I don't have one to hand.
Additional context Quoting a conversation in IRC today between @dstufft and @techalchemy (quotes are condensed and edited for grammar):
@dstufft said of the current Warehouse JSON API, "I don't think it's usable in it's current form for resolving dependencies". Regarding the metadata available, which clients would otherwise need to download packages to acquire, "the data is wrong is the main thing ... for dep info .... because warehouse (and originally PyPI's data model) is wrong. We only store the dependency information per Release, but it can vary per file."
@techalchemy asked: "so which file do you pick for parsing dependencies? the first wheel that gets uploaded? Or the last one?" @dstufft: "first file uploaded creates the Release object. which is also problematic, if you upload a sdist first no dependency information is encoded.... At one point only twine worked to upload dependency information. If you uploaded with setuptools it didn't get sent no matter what."
Donald also noted, on parseability of that info, "We [Warehouse] do not currently parse anything inside of a wheel, in part because we never did, in part because upload already takes forever and the more stuff we do the longer it takes. I think our timeout on upload is multiple minutes, because that's how long it takes sometimes." (That's a reason for #7730 but we should not block on that.)
"We might want to tweak the JSON API a bit just to make it suitable for the primary use case I think people want it for, and when I say tweak, I basically mean add a field or two to a dict inside of alist"