Open brettcannon opened 2 years ago
To me, the whole idea of
metadata.json
seems... worse than what we have now?
That's fine, but that's way out in the future in terms of a potential discussion. As of right now we are just trying to figure out how to get what's in METADATA
into an object so it can be worked with. The idea of transitioning to metadata.json
is totally a moonshot dream of mine and not something being seriously discussed ATM.
Over in #569 , discussions with @dstufft about packaging.metadata
and trying to support both the raw and validated data use cases came up while discussing the current Metadata
class in main
. I thought about it a bit and I think I have a design that would work for everyone while also making me happy with how to maintain this long-term and be flexible enough to not break people if we add stricter value checks for a core metadata field later on. An untested version with code to (partially) process pyproject.toml
can be found at:
https://gist.github.com/brettcannon/731ddd584bad01a5ee678d332a932041
Essentially I made Metadata
have raw_
attributes and canonical_
attributes. Every core metadata field has a corresponding raw_
attribute that stores raw/unprocessed data (e.g. raw_name
). Some fields have a canonical_
attribute which is a property that lazily does the appropriate processing of the data (e.g. canonical_name
). This should cover everyone's use cases:
METADATA
by having everything go into the raw_
attributes.canonical_
attributes as needed means you only risk failure for data you explicitly care about being valid.canonical_
attributes means strict production of metadata.Another perk of the separate raw_
/canonical_
attributes is if we change a field's requirements later on it won't break anyone who wasn't expecting it as they will not have used the canonical_
attribute to begin with. This has already come up thanks to PEP 685 and how we tweaked Provides-Extra
; pre-existing code would have been using raw_provides_extras
but those ready to adopt the new requirements could start using canonical_provides_extras
when they are ready to.
One other take on my comment above is to have a RawMetadata
that just structures all the strings in an nice dataclass-like object and has methods for consuming core metadata. Then we can have a Metadata
class whose constructor is:
class Metadata:
def __init__(self, raw_metadata: RawMetadata, /) -> None:
self.raw = raw_metadata
And then all production methods go on Metadata
. We would lose the ability to assign to to the raw data object underneath and have it clear out the canonical data (see all of the raw_
properties in my gist), but maybe that's a good thing since you shouldn't be manipulating raw data when there's a normalized version to validate with instead?
What I'm trying to do is service the use case of someone building metadata from scratch and wanting to use the normalized objects as appropriate. I'm also assuming that constructing e.g. Version
is expensive enough to want to cache appropriately, but to also support changing the raw version string as well. But maybe the assumption should be you will only set the version once, either raw or by constructing a Version
, and so worrying about synchronizing between the two forms beyond the initial assignment is unnecessary? Maybe its better to assume you will gather all of the metadata upfront and then construct the metadata objects just once? Then we just have e.g. Metadata.version()
to read from raw.version
and the method caches the resulting Version
object?
That’s basically what my branch does, except:
You can delay validation in metadata to access. I think that ends up working so that people don’t have to touch raw metadata unless they need to, but it’s still there under the covers if needed.
The default constructor takes individual metadata values.
I thought about that for specific fields that have normalized value counterparts, but I was too lazy to think it through. 😁
There are extra constructors for from raw metadata or from json / rfc8223.
I didn't want to bother specifying all potential constructors; once again, lazy. 😁
If accepting normalized values post-creation isn't important, then I think I figured out a simple caching mechanism for using functions to access normalized values (e.g. a version(raw: RawMetadata, /) -> Version
function).
https://gist.github.com/brettcannon/67dc464e1c838ca9dc5aa368168dae90
It's rather simple, keeps access to be lazy, and doesn't force validating unless one requests it. But as I said, it doesn't facilitate taking a e.g. Version
instance and then adding it to a RawMetadata
instance.
That leaks memory doesn't it? The cache = {}
never gets cleared and lives forever per function.
In my branch I had Metadata
and RawMetadata
, the idea being that you generally didn't want to interact with or think about RawMetadata
, but if you needed access to the "raw" values, it was there. It cached the values, but it did so by caching them on the Metadata
instance, so once that fell out of scope, so did the cache for that.
I think it's important that the primary API defaults to validating, even for things we don't know need validated against in the future.
That means that something like the raw_*
and canonical_*
would need to have a canoncal_*
for every single metadata type, even if the canonical is just a basic string for right now. Otherwise adding new validations requires code changes across the entire ecosystem, rather than just happening by default.
I also don't like that the raw_
and canonical_
split mix the raw and validated data in the same object, it feels messy but that's a personal thing.
This idea also means that treating the RawMetadata
as the primary API, and having a series of functions you use to access the metadata validated, is more likely going to see people inadvertently not validate metadata because it's easier to just not call those functions than it is to call them.
When I was designing the API in my branch, I had a few design principles I had in mind:
This lead me to a design that looks like this:
For handling Raw Metadata (intended to have minimal validation, basically just serialization/deserialization):
class RawMetdata(TypedDict, total=False):
name: str
version: str
# etc
def parse_email(data: Union[bytes, str]) -> Tuple[RawMetadata, Dict[Any, Any]]:
...
def emit_email(raw: RawMetadata) -> bytes:
...
def parse_json(data: Union[bytes, str]) -> Tuple[RawMetadata, Dict[Any, Any]]:
...
def emit_json(raw: RawMetadata) -> bytes:
...
The intention here is that most people will not directly interact with this class or these functions, but they exist entirely to handle serialization and deserialization to a common "raw" data type. You can get it to emit invalid data by passing invalid data via RawMetadata
, and it will happily read invalid data (though it does type check it).
The extra Dict[Any, Any]
is used for "leftover" data, unknown fields, fields that didn't type check, etc.
Thus at this level you can see if we were able to fully serialize a given metadata file by doing something like:
raw, leftover = parse_email(data)
if leftover:
raise ValueError("could not parse all of the data")
This leftover bit is in support of (5), most parsers for metadata just silently throw away extraneous data, including duplicate keys, which is just asking for confused deputy style attacks to happen.
On top of this then, is layered the primary API that most people are expected to work with, which looks like:
class Metadata:
name: str
version: str
# etc
def __init__(self, *, name = Optional[str] = None, version = Optional[str] = None, ...) -> None:
...
@classmethod
def from_raw(cls, raw: RawMetadata, *, validate: bool = True) -> Metadata:
...
@classmethod
def from_email(cls, data: bytes | str, *, validate: bool = True) -> Metadata:
raw, unparsed = parse_email(data)
# Regardless of the validate attribute, we don't let unparsed data
# pass silently, if someone wants to drop unparsed data on the floor
# they can call parse_email themselves and pass it into from_raw
if unparsed:
raise ValueError(
f"Could not parse, extra keys: {', '.join(unparsed.keys())}"
)
return cls.from_raw(raw, validate=validate)
def emit_email(self) -> bytes:
...
@classmethod
def from_json(cls, data: bytes | str, *, validate: bool = True) -> Metadata:
...
@classmethod
def emit_json(self) -> bytes:
...
Now the above is the API, but not how it's implemented exactly, but I want to talk about the API here not the underlying implementation.
One of the core ideas behind the Metadata
class is that it never emits invalid metadata, and you can never access an invalid metadata field through it. In fact by default it ensures the entire metadata structure is valid up front, including things that needs to span multiple fields.
It does however allow you to opt into a somewhat less strict validation where items get lazily validated upon access rather than up front, meaning that an invalid version doesn't prevent you from accessing an otherwise valid name.
So breaking down the API here:
The Metadata.__init__
exists for people to programmatically construct a Metadata
instance without reading from an existing metadata file. It accepts all of the possible metadata fields as keyword arguments, and for any of them provided, it sets that provided valid. It does not deal with RawMetadata
, because RawMetadata
is a low level API that doesn't deal with validation, and if you're using Metadata
you want validation.
The Metadata.from_raw
exists as the main holder of logic for taking a RawMetadata
and turning it into a Metadata
, while ach of the Metadata.from_email
and Metadata.from_json
just act as a wrapper around it that abstracts away the need for the end user to think about RawMetadata
.
I think this separation is important here. One of the ideas behind this design is that by having RawMetadata
to act as an interim layer, we can implement/maintain serialization in isolation from validation and the general programmatic APIs. It means that future formats (YAML, SQLLite, whatever) can be written without making changes to Metadata
, and also they don't even have to live inside of packaging
itself, so it makes proposing new formats somewhat easier.
The small helper wrappers around Metadata.from_raw
also make it easier for people to do the right thing with regards to the leftover data bit. Since the underlying parse_email
and parse_json
functions don't error out on left over data, to implement a properly strict metadata parser you need to, but that's something that end users can easily forget to do. This makes it so that the easy path also is the strictest path.
The biggest set of hidden functionality in the above API is accessing the metdata itself, the API from the outside looks something like this:
class Metadata:
name: str
version: str
# etc
But I don't make those simply variables for a few reasons:
Metadata
instance is always valid, so we need to validate it before we assign it.In Python that's pretty obviously something using the descriptor protocol, typically a property. You could implement the above like:
class Metadata:
_raw: RawMetadata
@property
def name(self) -> str:
_validate_name(self._raw.name) # Raises if invalid
return self._raw.name
@name.setter
def name(self, value: str):
_validate_name(value) # Raises if invalid
self._raw.name = value
@name.deleter
def name(self):
del self._raw["name"]
# etc
You could then implement the non lazy validation by just iterating over the attributes of Metadata
and accessing each one of them in the from_raw
method.
However, I thought that was super verbose, so I wrote an internal helper called lazy_validation
which abstracts away the above so that it looks like:
class _ValidatedMetadata(TypedDict, total=False):
metadata_version: str
name: str
# etc
class Metadata:
_raw: RawMetadata
_validated: _ValidatedMetadata
name = lazy_validator(
as_str, # Ensures that the value is in fact a str
validators=[
Required(), # Errors if the value isn't provided
RegexValidator("(?i)^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$"), # Errors if the value doesn't match the provided regex
],
)
That way the Metadata
class doesn't have to get bogged down in details of managing the cache, that's all abstracted out.
So to summarize, the design in my branch has a number of, in my opinion, really strong design points:
Metadata
its not possible to emit invalid metadata, nor to access invalid metadata, including as the meaning of "valid" changes in the future.json.loads
or email.parser
themselves) while also making it easier because validation and serialization logic cannot get intertwined. I guess my biggest question here, is there something in the above API or implementation that you are uncomfortable with or that you feel could be done better?
The alternatives proposed thus far feel like they compromise on the design goals I had when I originally wrote that PoC without getting something in return besides being different, but I may be missing a trade off that you're trying to make here!
Also as an aside, we talked about it before, but I do not think we should try cramming pyproject.toml
into this, while some of the metadata in pyproject.toml
will eventually end up in the core metdata, pyproject.toml
is not a core metadata file nor could it be. See the previous discussion.
I like the idea of explicitly returning "leftover" data, as that ensures that consumers make an explicit decision on what to do (which might of course be to simply ignore it).
My use case involves extracting metadata in bulk from PyPI projects for loading into a database for queries. For that, I have to be able to handle raw values - key examples of the sort of query I'm interested in are "how many projects have invalid versions?", or "what do non-UTF8 descriptions that exist in the wild look like?" The "leftover" data would allow me to query for "what projects include invalid/nonstandard metadata fields?"
I like Brett's approach of only validating "on demand" - it's important to me to be able to do things like analyze (valid) versions without caring if the dependency data is valid, for example.
I'm perfectly fine with an API that ensures that user-constructed metadata objects only contain valid values, but I'm concerned that in the process of enforcing that, we don't lose the ability to read and manipulate potentially-invalid "raw" data. I don't have any real-world examples here, largely because this is the sort of thing I'd do as part of ad-hoc analysis, but I'm thinking of cases where I might read 1,000,000 raw metadata values (email format read from a database, for example), and set the description value to None
because I'm not interested in it and it is potentially large, and then write the resulting values out as JSON to a file for later processing. That requires the ability to round-trip invalid dependency data while modifying the in-memory Metadata objects.
Overall, I like the direction the discussion is going in, and I agree that the most important use cases should be:
But I think we should be mindful of other use cases - packaging
is intended to be the canonical library for implementing packaging fundamentals, and IMO that means being usable in all situations when people are working with packaging data.
I also agree that reading pyproject.toml
and converting it to metadata is not the job of this module. I wrote a pyproject.toml
reader for pkg_metadata
, and it really didn't fit well with the rest of the API. A library to read and validate PEP 621 metadata from pyproject.toml
is a perfectly reasonable idea, but it feels better to me as a separate module (which could be part of packaging
, or could be a standalone project).
FWIW, you get as much access to invalid data as you want:
RawMetadata
directly which doesn't do any validation at all, just deserializes[^1].parse_email()
function to get a RawMetadata
and discard the leftover data yourself, then pass that into Metadata.from_raw(raw, validate=False)
.Metadata.from_email(data, validate=False)
.Metadata.from_email(data)
.The only case that really isn't handled afaict is:
RawMetadata
, which would only be in extreme cases (a json document that doesn't parse as json, etc) that I think there isn't really a good option to do.The thing here is that in my branch, validating on demand is opt in, by default when you're using the Metadata
class, you're going to get fully valid data. We could switch the default of that around, but validate everything upfront felt like a safer default to me.
[^1]: This includes things like Keywords deserializing into a list of strings, or project urls into a mapping. If those fail the data would be in leftover instead.
That leaks memory doesn't it? The
cache = {}
never gets cleared and lives forever per function.
Depends on your definition of "leak" I guess. 😉
You could then implement the non lazy validation by just iterating over the attributes of
Metadata
and accessing each one of them in thefrom_raw
method.However, I thought that was super verbose
True, but it's a one-time cost since we are not adding new core metadata fields often.
However, I thought that was super verbose, so I wrote an internal helper called
lazy_validation
which abstracts away the above
I don't want to re-implement pydantic, so I would want to keep the custom code for creating a custom descriptor under control as I can see us getting carried away.
I guess my biggest question here, is there something in the above API or implementation that you are uncomfortable with or that you feel could be done better?
I've been trying to avoid having two separate definitions of the core metadata, but it may not be helped in the end. You have a TypedDict
and a class for representing raw and valid data, respectively, and I was seeing if there was a way to avoid it. I do understand wanting to avoid an API where one could be confused with whether they are working with raw or validated data. Perhaps having two very distinct APIs is the only way to really get away with that.
we talked about it before, but I do not think we should try cramming
pyproject.toml
into this, while some of the metadata inpyproject.toml
will eventually end up in the core metdata,pyproject.toml
is not a core metadata file nor could it be.
It was honestly just the easiest to implement. Take it more as a demo then a plan.
The thing here is that in my branch
Is your branch ready for review?
- This includes things like Keywords deserializing into a list of strings, or project urls into a mapping. If those fail the data would be in leftover instead
I would argue that's a validation step, and so not even raw metadata should be making that call.
I don't want to re-implement pydantic
I don't have a super strong opinion on whether the lazy_validation
helper is "too much" or whether we'd be better off manually implementing that per property. I obviously think that the trade off is worth it, because I wrote it to begin with, but if it goes away I won't be sad either. It doesn't affect the API, just the implementation so it can be changed at anytime.
I've been trying to avoid having two separate definitions of the core metadata, but it may not be helped in the end. You have a
TypedDict
and a class for representing raw and valid data, respectively, and I was seeing if there was a way to avoid it. I do understand wanting to avoid an API where one could be confused with whether they are working with raw or validated data. Perhaps having two very distinct APIs is the only way to really get away with that.
I personally think that it's easier to understand and to implement to have the Metadata
and RawMetadata
split as well as harder to accidentally use unvalidated data when you wanted validated, but the core ideas could be represented by a single class. If you did that, you'd of course lose the separation of concerns between serialization and validation.
It was honestly just the easiest to implement. Take it more as a demo then a plan.
👍
Is your branch ready for review?
It's not ready to merge, it's missing some things:
Metadata.to_(raw|email|json)
functions, which should be tiny shims.Metadata.__init__()
constructor, which should be a trivial thing to write.But it's ready for a preliminary review to see if that direction is a direction that we even want to go in, before myself or someone else puts in additional work dotting I's and crossing T's. I didn't want to invest additional time until we had some agreement that it was worthwhile to finish it up.
I would argue that's a validation step, and so not even raw metadata should be making that call.
The way I think about serialization is it's job is to take some arbitrary bytes and turn it into the conceptually correct primitive type (this rules out parsing to Version
or something).
So if we look at something like project urls, that is conceptually a mapping of key to URL, however because the RFC 822 format doesn't support mapping natively, we had to implement a secondary, field specific, serialization ontop of RFC 822 that let us represent a mapping.
So that's why that deserialization lives in the raw layer IMO, and to support this, I'm not aware of any of the major tooling that has ever implemented project URLs as a list of specially formatted strings. IOW, human beings have always (or almost always) been working on the field as a mapping, and the tooling serialized it to a list of strings.
Keywords is similar, except it's been implemented as a free form text field for long enough by major tools that I do think you could argue that it's not serialization when human beings were expected to enter that data already "serialized".
- The proposed
Metadata.to_(raw|email|json)
functions, which should be tiny shims.
I don't think that needs to hold anything up. We were prepared to release with just the object definition and no support code. Honestly, just being able to read METADATA
/PKG-INFO
files will be a massive win for (hopefully) pip and other installers. We can than expand into METADATA
production for builders later.
But it's ready for a preliminary review to see if that direction is a direction that we even want to go in, before myself or someone else puts in additional work dotting I's and crossing T's.
I can give it a review then!
The way I think about serialization is it's job is to take some arbitrary bytes and turn it into the conceptually correct primitive type (this rules out parsing to
Version
or something).
I agree, I think the question is what the "correct primitive type" is. In the specific cases you're suggesting, at least the format is extremely simple. But parsing of Project-URL
, if you require a comma, can still fail. At least with keywords
that parse can't fail as it will just end up being poorly split. So in my head, RawMetadata
was to be something that simply could not fail based on the format of the metadata data (format of the actual container of that data could fail). And so for me, Project-URL
was a potential failure point and thus a list of strings (albeit a small one). I assume that's why https://peps.python.org/pep-0566/#json-compatible-metadata only makes a special exemption for keywords
as the parsing of that data can't trigger an exception (and so I can get behind keywords
being split). But I do think we need to come to an understanding on that as this will be part of documentation and perpetual design of RawMetadata
.
Deserialization can always fail right? Like if you emit a JSON that looks like:
{
"foo": "bar",
}
That's going to fail because the syntax is wrong, and likewise parsing RFC822 is fairly lenient, but even it can fail.
I don't see any reason why it's "OK" for {"name": "bar",}
to fail, but not "OK" for ["project_url": ["some url without a key"]}
to fail, and to be clear, "fail" doesn't mean raise an exception in RawMetadata
, it means it doesn't get deserialized into RawMetadata
and gets emitted into the leftover
data structure.
The same thing happens if you send a float instead of a string for a version number for instance.
I don't see any reason why it's "OK" for
{"name": "bar",}
to fail, but not "OK" for["project_url": ["some url without a key"]}
to fail
I view the JSON issue at a different layer than the data itself. Plus, I don't see parsing Project-URL
for a label and URL as any different than trying to parse Version
. With that view, my question becomes why does Project-URL
get special treatment to be eagerly parsed in RawMetadata
but Version
doesn't when both have an expected structure that may or may not work?
I believe that the JSON format should follow the definition in PEP 566, which means that keywords
is the only special case[^1], and project_url
should be a list of strings.
But it's ready for a preliminary review to see if that direction is a direction that we even want to go in
I'm happy to review it (can you remind me where to find it?)
@brettcannon your code at https://gist.github.com/brettcannon/731ddd584bad01a5ee678d332a932041 only seems to have a from_pyproject
method for loading data from external sources, at the moment. So I assume it's not ready for review (at least, in terms of the questions about parsing external data we're discussing here?)
[^1]: It's a single-use field in the metadata, and its single value in email format is a string, but in JSON format, and in the "raw" object, it's a list of strings.
I view the JSON issue at a different layer than the data itself. Plus, I don't see parsing
Project-URL
for a label and URL as any different than trying to parseVersion
. With that view, my question becomes why doesProject-URL
get special treatment to be eagerly parsed inRawMetadata
butVersion
doesn't when both have an expected structure that may or may not work?
A few reasons
Version
is not a primitive type, and RawMetdata
only emits primitive types, but a dict[str, str]
is a primitive type.Project-URL
as a list of strings, conceptually it is a mapping of str to str, but RFC822 doesn't support mappings, so an extra layer of serialization had to be added.
pyproject.toml
. The fact that Project-URL
is a list of strings is an implementation detail of RFC822. If we were defining a JSON format from scratch, the most logical serialization of it would be as a mapping, not as a list of strings.Version
would be an error in the data that a user inputted.I believe that the JSON format should follow the definition in PEP 566, which means that
keywords
is the only special case1, andproject_url
should be a list of strings.
RawMetdata
is not "the JSON format", it's the programmatic format for deserialized, but not validated data.
One could imagine adding a YAML or a TOML form for serializing this data, and using a mapping to handle project-url
.
I'm happy to review it (can you remind me where to find it?)
I assume it's not ready for review
Correct, my code was just a proposal.
Version
is not a primitive type, andRawMetdata
only emits primitive types, but adict[str, str]
is a primitive type.
So is that what you would want the documentation to say as the guideline as to whether something gets any sort of parsing for RawMetadata
? "Values are not validated, but when there is a simple, pragmatic representation of a value using Python's built-in types they will be used accordingly (e.g. a dict of strings)"? That way people won't ask for Provides-Dist
to be a tuple of requirement and extra or something?
I think how to document the guideline we will follow is my sticky point in all of this. And to be clear, I'm after a guideline we can give users about how we will add typings going forward, and not a rule.
I believe that the JSON format should follow the definition in PEP 566, which means that
keywords
is the only special case1, andproject_url
should be a list of strings.
RawMetdata
is not "the JSON format", it's the programmatic format for deserialized, but not validated data.
I also don't know how widely the JSON format from PEP 566 is used, so I'm not sure we should feel beholden to it regardless until we have explicit JSON serialization support.
So is that what you would want the documentation to say as the guideline as to whether something gets any sort of parsing for
RawMetadata
? "Values are not validated, but when there is a simple, pragmatic representation of a value using Python's built-in types they will be used accordingly (e.g. a dict of strings)"? That way people won't ask forProvides-Dist
to be a tuple of requirement and extra or something?I think how to document the guideline we will follow is my sticky point in all of this. And to be clear, I'm after a guideline we can give users about how we will add typings going forward, and not a rule.
Yea I think something like that is the guideline I'd document.
Like I'd rule out Requires-Dist
because it's conceptually a list of PEP 508 requirement strings, that's the user interface that is provided. User's aren't providing tuples of extras and requirements, they're providing a string.
I also don't know how widely the JSON format from PEP 566 is used, so I'm not sure we should feel beholden to it regardless until we have explicit JSON serialization support.
Maybe we should simply drop the conversions to and from JSON. If we're happy that the PEP 566 JSON format isn't actually used "in the wild", then let's drop it. We should just have "from_email" and "to_email" methods[^1] that read/write the email format (which is standardised), and leave everything else for users/3rd parties to write.
[^1]: I'm not 100% comfortable with the _email
suffix. It suggests you can pass an email.Message
instance, and it exposes an implementation detail of the format. But from_bytes_or_string
is just clumsy. Is there a good naming convention for methods that take (or produce) bytes or string data?
I'm fine dropping them, I included them both because the format was defined, and to validate the idea that the serialization would work for multiple formats.
I don't have a better name for those methods. I wouldn't want to use from/to bytes, because if we add json in the future that gets more confusing I think. Maybe from_rfc822
and to_rfc822
or something? I dunno.
The fact that the format is defined is what makes me hesitant. But I'd want to frame it as from and to a dict, and leave serialising the dict to the user. That would potentially be reusable for formats other than JSON, and there is some complexity in the dict <-> RawMetadata
conversion, whereas serialising to/from JSON is covered by the json module (and 3rd party alternatives, if speed matters to you). Also, a dict format lets people use "unofficial" serialisations like YAML if they want.
The problem is, once we go from dict
to RawMetadata
, we have three in-memory formats (dict, raw metadata and parsed metadata), which is getting silly. But is it any more silly than people converting metadata to a dict by converting to in-memory JSON and deserialising the JSON? Which is what I considered doing for my own code...
Yeah _rfc822
is marginally better IMO, but only marginally. The problem is that we don't actually have a name for the format. "Metadata" is used as a general term, not specific to the file format. And really, "email" and "RFC822" are just inaccurate, as the format is actually a (mangled) subset of those formats. But I guess if method names are the worst problem we face, we've pretty much won 🙂
To be clear, my branch at least doesn't expose any dict <-> RawMetadata
conversion, it currently supports:
rfc 822 bytes <-> RawMetdata
json bytes <-> RawMetadata
Metadata <-> RawMetadata
It has some helper methods to let people skip the interim RawMetdata
steps, and do:
rfc 822 bytes <-> Metadata
json bytes <-> Metadata
RawMetadata
is "just" a TypedDict
though, so it allows all of the same thing you've mentioned for dict
, it's just we've got functions that handled the serialize / deserialization, because it's the only way to handle some logic (like the keywords special case).
If someone wants an unofficial serialization, they just write their own functions that add like yaml <-> RawMetdata
, and then they can use that the same asthey would anything else.
RawMetadata is "just" a TypedDict though, so it allows all of the same thing you've mentioned for dict
Oh cool. I didn't know that's how TypedDict
worked. Off to read the manuals! 🙂
TypedDict
is "just" a dict, but in mypy it has special behavior (known keys, with known value types), but at runtime it's just a dict.
Ah, so it doesn't allow attribute-style access. OK, the definition syntax is misleading (to me) then, but fair enough.
Anyone know how to test for multi-part bodies? I copied this code from Donald's PR, but I can't figure out how to trigger the failure case:
Passing in Message().attach(1234)
should trigger that.
Passing in
Message().attach(1234)
should trigger that.
But is there any way to do it via some METADATA
format? If there isn't then maybe the code isn't necessary since the API only take strings or bytes?
Uhh, I probably added that because something on PyPI made it happen when I was testing it against PyPI data... but I don't remember what.
At least, I don't think I would have gone out of my way to do that on my own.
Trying to serialize an object with a list payload, at least, doesn't work. It trips up in
https://github.com/pypa/packaging/pull/671 covers RawMetadata
and parse_email()
.
Raw metadata parsing just landed in main
! Next is providing enriched/strict metadata.
🎉
We can also ignore non-standard fields (e.g.
License-Files
).
If PEP 639 gets accepted this becomes a standard field
@nilskattenbeck-bosch correct, but that hasn't happened yet (I should know, I'm the PEP delegate 😁). Once the field exists we will bump the max version for metadata and then update the code accordingly.
May I also suggest adding a small information box to the documentation as to why the function is called parse_email
. By now I read through the corresponding PEPs and specifications to understand that the metadata is serialized just like email headers and parsed using that module though at first it was really unnatural and felt like I was using the wrong function and that it would leak abstraction detail. If other parse_FORMAT
will be introduced then this naming makes sense though having an assurance that this is the correct method and why it is named that way would be assuring.
May I also suggest adding a small information box to the documentation as to why the function is called
parse_email
.
I would go even farther and say you can propose a PR to add such a note. 😁
With my work for reading metadata now complete, I'm personally considering my work on this issue done. Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this.
With my work for reading metadata now complete, I'm personally considering my work on this issue done.
Thank you very much @brettcannon for working on this.
Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this.
There was at least one previous PR (https://github.com/pypa/packaging/pull/498, probably more) that tried to address the "writing" capabilities for metadata, however these were closed because probably they don't fit into the long term vision that the maintainers would like for the API/implementation (which is a very good thing to have in a project).
Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it? (This question is targeted at all packaging
maintainers, not only Brett 😝).
I am just concerned that throwing PRs at packaging
without a clear design goal/acceptance criteria will just result on PRs getting closed and work hours being lost.
Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it?
I think the first question is what are the requirements for the feature to write metadata? Do you need to minimize diffs for METADATA
output that you originally read from? Or should the output be consistent across tools (i.e., field order is hard-coded)?
The other question is whether tools will build up the data in a TypedDict
via RawMetadata
and then use Metadata
to do the writing, or do you make it so you build up the metadata in Metadata
itself? And if you do the latter, do you do validation as you go, or as a last step before creating the bytes? I will admit I totally punted on this one based on how I coded Metadata
and it currently lends itself to building up via RawMetadata
as the descriptor I used is a non-data descriptor and thus the caching does not lend itself to attribute assignment.
There's also how much you want to worry about older metadata versions? Do you let the user specify the metadata version you're going to write out and thus need to do fancier checks for what's allowed?
For me, I think consistency in output is more important than keeping pre-existing METADATA
files from having a smaller diff (since long-term that will happen naturally). As for the API, I think that's up to people like you, @abravalheri , who author build back-ends since you will be the folks using any construction API. But what I will say is if you build up a RawMetadata
dict then an API to generate the bytes is surprisingly straightforward to code up (define the descriptors in the order you want them written out, iterate over Metadata
for all instances of _Validator
, get the values, and then introspect on the results to know how to write out the format appropriately for strings vs lists vs dicts).
I'd also mention that everything I've done with METADATA
has primarily been consuming them not emitting them, but my intuition is that it should have these properties:
Metadata
object, and that object maintains the requirement that you can only get or set valid metadata to it. This Metadata
object has a Metadata().to_raw()
function that returns a RawMetadata
.RawMetadata
object, which is a dict at runtime or a TypedDict
at type check time, and there is minimal validation here (basically just that primitive types match).write_email
, function that takes a RawMetadata
and returns the serialized bytes, probably also any leftover fields that it didn't know how to serialize.Metadata().to_email()
functions that act as a wrapper around Metadata().to_raw()
and write_email
. Oh ugh, emitting probably also has to answer the newlines in fields question, and possibly the other issues I raised with the spec earlier up thread.
Just a note, I have a PR up now to Warehouse (https://github.com/pypi/warehouse/pull/14718) that switches our metadata validation on upload from a custom validation routine implemented using wtforms to packaging.metadata
.
The split between Metadata
and RawMetadata
was super useful, since we (currently) have the metadata handed to Warehouse as a multipart form data on the request, I was able to make a custom shim to generate a RawMetadata
from that form data, and then just pass that into the Metadata.from_raw()
.
I'm still giving it a through set of manual testing, but so far it looks like other than a few bugs/issues that fell out (https://github.com/pypa/packaging/issues/733, https://github.com/pypa/packaging/issues/735) integrating it was relatively painless. Of course the real test will be when it goes live and that wide array of nonsense that people upload starts flowing through it.
We're not yet using the actual parsing of METADATA
files yet (though that is planned), so this is strictly just the validation aspect of the Metadata
class that we're currently using.
Do you think it might make sense to keep both the old code and the new code paths running for a bit on warehouses' end, with the results of the old code path being returned?
i.e.
def upload(...):
try:
new_result = new_upload(...)
except Exception:
new_result = ...
old_result = old_upload_handler(...)
if new_result != old_result:
log.somehow(old_result, new_result)
return old_result