Plans for `packaging.metadata`

brettcannon commented 2 years ago

[x] Parse "raw" metadata (https://github.com/pypa/packaging/pull/671)
[x] Parse "strict" metadata
[ ] Emit "strict" metadata

brettcannon commented 2 years ago

To me, the whole idea of metadata.json seems... worse than what we have now?

That's fine, but that's way out in the future in terms of a potential discussion. As of right now we are just trying to figure out how to get what's in METADATA into an object so it can be worked with. The idea of transitioning to metadata.json is totally a moonshot dream of mine and not something being seriously discussed ATM.

brettcannon commented 1 year ago

Over in #569 , discussions with @dstufft about packaging.metadata and trying to support both the raw and validated data use cases came up while discussing the current Metadata class in main. I thought about it a bit and I think I have a design that would work for everyone while also making me happy with how to maintain this long-term and be flexible enough to not break people if we add stricter value checks for a core metadata field later on. An untested version with code to (partially) process pyproject.toml can be found at:

https://gist.github.com/brettcannon/731ddd584bad01a5ee678d332a932041

Essentially I made Metadata have raw_ attributes and canonical_ attributes. Every core metadata field has a corresponding raw_ attribute that stores raw/unprocessed data (e.g. raw_name). Some fields have a canonical_ attribute which is a property that lazily does the appropriate processing of the data (e.g. canonical_name). This should cover everyone's use cases:

Flexible/forgiving consumption of e.g. METADATA by having everything go into the raw_ attributes.
Accessing canonical_ attributes as needed means you only risk failure for data you explicitly care about being valid.
Writing out data using all available canonical_ attributes means strict production of metadata.

Another perk of the separate raw_/canonical_ attributes is if we change a field's requirements later on it won't break anyone who wasn't expecting it as they will not have used the canonical_ attribute to begin with. This has already come up thanks to PEP 685 and how we tweaked Provides-Extra; pre-existing code would have been using raw_provides_extras but those ready to adopt the new requirements could start using canonical_provides_extras when they are ready to.

brettcannon commented 1 year ago

One other take on my comment above is to have a RawMetadata that just structures all the strings in an nice dataclass-like object and has methods for consuming core metadata. Then we can have a Metadata class whose constructor is:

class Metadata:
    def __init__(self, raw_metadata: RawMetadata, /) -> None:
        self.raw = raw_metadata

And then all production methods go on Metadata. We would lose the ability to assign to to the raw data object underneath and have it clear out the canonical data (see all of the raw_ properties in my gist), but maybe that's a good thing since you shouldn't be manipulating raw data when there's a normalized version to validate with instead?

What I'm trying to do is service the use case of someone building metadata from scratch and wanting to use the normalized objects as appropriate. I'm also assuming that constructing e.g. Version is expensive enough to want to cache appropriately, but to also support changing the raw version string as well. But maybe the assumption should be you will only set the version once, either raw or by constructing a Version, and so worrying about synchronizing between the two forms beyond the initial assignment is unnecessary? Maybe its better to assume you will gather all of the metadata upfront and then construct the metadata objects just once? Then we just have e.g. Metadata.version() to read from raw.version and the method caches the resulting Version object?

dstufft commented 1 year ago

That’s basically what my branch does, except:

The default constructor takes individual metadata values.
There are extra constructors for from raw metadata or from json / rfc8223.

You can delay validation in metadata to access. I think that ends up working so that people don’t have to touch raw metadata unless they need to, but it’s still there under the covers if needed.

brettcannon commented 1 year ago

The default constructor takes individual metadata values.

I thought about that for specific fields that have normalized value counterparts, but I was too lazy to think it through. 😁

There are extra constructors for from raw metadata or from json / rfc8223.

I didn't want to bother specifying all potential constructors; once again, lazy. 😁

brettcannon commented 1 year ago

If accepting normalized values post-creation isn't important, then I think I figured out a simple caching mechanism for using functions to access normalized values (e.g. a version(raw: RawMetadata, /) -> Version function).

https://gist.github.com/brettcannon/67dc464e1c838ca9dc5aa368168dae90

It's rather simple, keeps access to be lazy, and doesn't force validating unless one requests it. But as I said, it doesn't facilitate taking a e.g. Version instance and then adding it to a RawMetadata instance.

dstufft commented 1 year ago

That leaks memory doesn't it? The cache = {} never gets cleared and lives forever per function.

In my branch I had Metadata and RawMetadata, the idea being that you generally didn't want to interact with or think about RawMetadata, but if you needed access to the "raw" values, it was there. It cached the values, but it did so by caching them on the Metadata instance, so once that fell out of scope, so did the cache for that.

I think it's important that the primary API defaults to validating, even for things we don't know need validated against in the future.

That means that something like the raw_* and canonical_* would need to have a canoncal_* for every single metadata type, even if the canonical is just a basic string for right now. Otherwise adding new validations requires code changes across the entire ecosystem, rather than just happening by default.

I also don't like that the raw_ and canonical_ split mix the raw and validated data in the same object, it feels messy but that's a personal thing.

This idea also means that treating the RawMetadata as the primary API, and having a series of functions you use to access the metadata validated, is more likely going to see people inadvertently not validate metadata because it's easier to just not call those functions than it is to call them.

When I was designing the API in my branch, I had a few design principles I had in mind:

The primary API, and all of the defaults for it, should lead people to work entirely in fully validated metadata.
The APIs should make it obvious whether you're working with validated or possibly invalid metadata.
It should not be possible to emit invalid metadata without using a lower level API.
Strictness of the above exists in a gradient, so ideally we allow varying shades of strictness.
Whenever invalid metadata exists, it should never get silently thrown away and ideally generates an error if possible.

This lead me to a design that looks like this:

For handling Raw Metadata (intended to have minimal validation, basically just serialization/deserialization):


class RawMetdata(TypedDict, total=False):
    name: str
    version: str
    # etc

def parse_email(data: Union[bytes, str]) -> Tuple[RawMetadata, Dict[Any, Any]]:
    ...

def emit_email(raw: RawMetadata) -> bytes:
    ...

def parse_json(data: Union[bytes, str]) -> Tuple[RawMetadata, Dict[Any, Any]]:
    ...

def emit_json(raw: RawMetadata) -> bytes:
    ...

The intention here is that most people will not directly interact with this class or these functions, but they exist entirely to handle serialization and deserialization to a common "raw" data type. You can get it to emit invalid data by passing invalid data via RawMetadata, and it will happily read invalid data (though it does type check it).

The extra Dict[Any, Any] is used for "leftover" data, unknown fields, fields that didn't type check, etc.

Thus at this level you can see if we were able to fully serialize a given metadata file by doing something like:


raw, leftover = parse_email(data)
if leftover:
    raise ValueError("could not parse all of the data")

This leftover bit is in support of (5), most parsers for metadata just silently throw away extraneous data, including duplicate keys, which is just asking for confused deputy style attacks to happen.

On top of this then, is layered the primary API that most people are expected to work with, which looks like:


class Metadata:

    name: str
    version: str
    # etc

    def __init__(self, *, name = Optional[str] = None, version = Optional[str] = None, ...) -> None:
        ...

    @classmethod
    def from_raw(cls, raw: RawMetadata, *, validate: bool = True) -> Metadata:
        ...

    @classmethod
    def from_email(cls, data: bytes | str, *, validate: bool = True) -> Metadata:
        raw, unparsed = parse_email(data)

        # Regardless of the validate attribute, we don't let unparsed data
        # pass silently, if someone wants to drop unparsed data on the floor
        # they can call parse_email themselves and pass it into from_raw
        if unparsed:
            raise ValueError(
                f"Could not parse, extra keys: {', '.join(unparsed.keys())}"
            )

        return cls.from_raw(raw, validate=validate)

    def emit_email(self) -> bytes:
        ...

    @classmethod
    def from_json(cls, data: bytes | str, *, validate: bool = True) -> Metadata:
        ...

    @classmethod
    def emit_json(self) -> bytes:
        ...

Now the above is the API, but not how it's implemented exactly, but I want to talk about the API here not the underlying implementation.

One of the core ideas behind the Metadata class is that it never emits invalid metadata, and you can never access an invalid metadata field through it. In fact by default it ensures the entire metadata structure is valid up front, including things that needs to span multiple fields.

It does however allow you to opt into a somewhat less strict validation where items get lazily validated upon access rather than up front, meaning that an invalid version doesn't prevent you from accessing an otherwise valid name.

So breaking down the API here:

The Metadata.__init__ exists for people to programmatically construct a Metadata instance without reading from an existing metadata file. It accepts all of the possible metadata fields as keyword arguments, and for any of them provided, it sets that provided valid. It does not deal with RawMetadata, because RawMetadata is a low level API that doesn't deal with validation, and if you're using Metadata you want validation.

The Metadata.from_raw exists as the main holder of logic for taking a RawMetadata and turning it into a Metadata, while ach of the Metadata.from_email and Metadata.from_json just act as a wrapper around it that abstracts away the need for the end user to think about RawMetadata.

I think this separation is important here. One of the ideas behind this design is that by having RawMetadata to act as an interim layer, we can implement/maintain serialization in isolation from validation and the general programmatic APIs. It means that future formats (YAML, SQLLite, whatever) can be written without making changes to Metadata, and also they don't even have to live inside of packaging itself, so it makes proposing new formats somewhat easier.

The small helper wrappers around Metadata.from_raw also make it easier for people to do the right thing with regards to the leftover data bit. Since the underlying parse_email and parse_json functions don't error out on left over data, to implement a properly strict metadata parser you need to, but that's something that end users can easily forget to do. This makes it so that the easy path also is the strictest path.

The biggest set of hidden functionality in the above API is accessing the metdata itself, the API from the outside looks something like this:


class Metadata:

    name: str
    version: str
    # etc

But I don't make those simply variables for a few reasons:

We want to allow people to assign to them BUT we want to hold the invariant that data assigned to a Metadata instance is always valid, so we need to validate it before we assign it.
We want to allow lazy validation of attributes (though as mentioned above, not by default) which means that when we're lazily validating, we need a "hook" to validate before returning the data.

In Python that's pretty obviously something using the descriptor protocol, typically a property. You could implement the above like:


class Metadata:

    _raw: RawMetadata

    @property
    def name(self) -> str:
        _validate_name(self._raw.name)  # Raises if invalid
        return self._raw.name

    @name.setter
    def name(self, value: str):
        _validate_name(value)  # Raises if invalid
        self._raw.name = value

    @name.deleter
    def name(self):
        del self._raw["name"]

    # etc

You could then implement the non lazy validation by just iterating over the attributes of Metadata and accessing each one of them in the from_raw method.

However, I thought that was super verbose, so I wrote an internal helper called lazy_validation which abstracts away the above so that it looks like:


class _ValidatedMetadata(TypedDict, total=False):
    metadata_version: str
    name: str

    # etc

class Metadata:
    _raw: RawMetadata
    _validated: _ValidatedMetadata

    name = lazy_validator(
        as_str,  # Ensures that the value is in fact a str
        validators=[
            Required(),  # Errors if the value isn't provided
            RegexValidator("(?i)^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$"),  # Errors if the value doesn't match the provided regex
        ],
    )

That way the Metadata class doesn't have to get bogged down in details of managing the cache, that's all abstracted out.

So to summarize, the design in my branch has a number of, in my opinion, really strong design points:

When interacting with Metadata its not possible to emit invalid metadata, nor to access invalid metadata, including as the meaning of "valid" changes in the future.
The easiest "main" API, along with the defaults, provides the safest, strictest interpretation of metadata, requiring people to opt in to varying levels of strictness.
End users have lower level details abstracted away from them, unless they need the power afforded to them by that lower level abstraction.
Serialization and Validation are kept apart, enabling the use case of people who want the parsing to be as lenient as possible (besides just pulling out json.loads or email.parser themselves) while also making it easier because validation and serialization logic cannot get intertwined.

dstufft commented 1 year ago

I guess my biggest question here, is there something in the above API or implementation that you are uncomfortable with or that you feel could be done better?

The alternatives proposed thus far feel like they compromise on the design goals I had when I originally wrote that PoC without getting something in return besides being different, but I may be missing a trade off that you're trying to make here!

Also as an aside, we talked about it before, but I do not think we should try cramming pyproject.toml into this, while some of the metadata in pyproject.toml will eventually end up in the core metdata, pyproject.toml is not a core metadata file nor could it be. See the previous discussion.

pfmoore commented 1 year ago

I like the idea of explicitly returning "leftover" data, as that ensures that consumers make an explicit decision on what to do (which might of course be to simply ignore it).

My use case involves extracting metadata in bulk from PyPI projects for loading into a database for queries. For that, I have to be able to handle raw values - key examples of the sort of query I'm interested in are "how many projects have invalid versions?", or "what do non-UTF8 descriptions that exist in the wild look like?" The "leftover" data would allow me to query for "what projects include invalid/nonstandard metadata fields?"

I like Brett's approach of only validating "on demand" - it's important to me to be able to do things like analyze (valid) versions without caring if the dependency data is valid, for example.

I'm perfectly fine with an API that ensures that user-constructed metadata objects only contain valid values, but I'm concerned that in the process of enforcing that, we don't lose the ability to read and manipulate potentially-invalid "raw" data. I don't have any real-world examples here, largely because this is the sort of thing I'd do as part of ad-hoc analysis, but I'm thinking of cases where I might read 1,000,000 raw metadata values (email format read from a database, for example), and set the description value to None because I'm not interested in it and it is potentially large, and then write the resulting values out as JSON to a file for later processing. That requires the ability to round-trip invalid dependency data while modifying the in-memory Metadata objects.

Overall, I like the direction the discussion is going in, and I agree that the most important use cases should be:

Reading (mostly-) valid data from serialised forms, and processing valid values, with errors for invalid raw values.
Building new metadata objects in memory ensuring that everything is valid, and serialising those.

But I think we should be mindful of other use cases - packaging is intended to be the canonical library for implementing packaging fundamentals, and IMO that means being usable in all situations when people are working with packaging data.

I also agree that reading pyproject.toml and converting it to metadata is not the job of this module. I wrote a pyproject.toml reader for pkg_metadata, and it really didn't fit well with the rest of the API. A library to read and validate PEP 621 metadata from pyproject.toml is a perfectly reasonable idea, but it feels better to me as a separate module (which could be part of packaging, or could be a standalone project).

dstufft commented 1 year ago

FWIW, you get as much access to invalid data as you want:

Wholly invalid data, use the RawMetadata directly which doesn't do any validation at all, just deserializes[^1].
Access to specific validated fields, while the rest of the metadata is invalid, two options, both of which will not validate metadata fields until they've been accessed:
- If you want to silently ignore leftover data, then use the parse_email() function to get a RawMetadata and discard the leftover data yourself, then pass that into Metadata.from_raw(raw, validate=False).
- If you want to error on leftover data, but you otherwise don't care about validating fields you don't access, then use Metadata.from_email(data, validate=False).
No invalid data, validating all the fields up front,, use Metadata.from_email(data).

The only case that really isn't handled afaict is:

My data is so malformed that we can't even get it into RawMetadata, which would only be in extreme cases (a json document that doesn't parse as json, etc) that I think there isn't really a good option to do.

The thing here is that in my branch, validating on demand is opt in, by default when you're using the Metadata class, you're going to get fully valid data. We could switch the default of that around, but validate everything upfront felt like a safer default to me.

[^1]: This includes things like Keywords deserializing into a list of strings, or project urls into a mapping. If those fail the data would be in leftover instead.

brettcannon commented 1 year ago

That leaks memory doesn't it? The cache = {} never gets cleared and lives forever per function.

Depends on your definition of "leak" I guess. 😉

You could then implement the non lazy validation by just iterating over the attributes of Metadata and accessing each one of them in the from_raw method.

However, I thought that was super verbose

True, but it's a one-time cost since we are not adding new core metadata fields often.

However, I thought that was super verbose, so I wrote an internal helper called lazy_validation which abstracts away the above

I don't want to re-implement pydantic, so I would want to keep the custom code for creating a custom descriptor under control as I can see us getting carried away.

I guess my biggest question here, is there something in the above API or implementation that you are uncomfortable with or that you feel could be done better?

I've been trying to avoid having two separate definitions of the core metadata, but it may not be helped in the end. You have a TypedDict and a class for representing raw and valid data, respectively, and I was seeing if there was a way to avoid it. I do understand wanting to avoid an API where one could be confused with whether they are working with raw or validated data. Perhaps having two very distinct APIs is the only way to really get away with that.

we talked about it before, but I do not think we should try cramming pyproject.toml into this, while some of the metadata in pyproject.toml will eventually end up in the core metdata, pyproject.toml is not a core metadata file nor could it be.

It was honestly just the easiest to implement. Take it more as a demo then a plan.

The thing here is that in my branch

Is your branch ready for review?

This includes things like Keywords deserializing into a list of strings, or project urls into a mapping. If those fail the data would be in leftover instead

I would argue that's a validation step, and so not even raw metadata should be making that call.

dstufft commented 1 year ago

I don't want to re-implement pydantic

I don't have a super strong opinion on whether the lazy_validation helper is "too much" or whether we'd be better off manually implementing that per property. I obviously think that the trade off is worth it, because I wrote it to begin with, but if it goes away I won't be sad either. It doesn't affect the API, just the implementation so it can be changed at anytime.

I've been trying to avoid having two separate definitions of the core metadata, but it may not be helped in the end. You have a TypedDict and a class for representing raw and valid data, respectively, and I was seeing if there was a way to avoid it. I do understand wanting to avoid an API where one could be confused with whether they are working with raw or validated data. Perhaps having two very distinct APIs is the only way to really get away with that.

I personally think that it's easier to understand and to implement to have the Metadata and RawMetadata split as well as harder to accidentally use unvalidated data when you wanted validated, but the core ideas could be represented by a single class. If you did that, you'd of course lose the separation of concerns between serialization and validation.

It was honestly just the easiest to implement. Take it more as a demo then a plan.

👍

Is your branch ready for review?

It's not ready to merge, it's missing some things:

The proposed Metadata.to_(raw|email|json) functions, which should be tiny shims.
The actual Metadata.__init__() constructor, which should be a trivial thing to write.
Not all of the fields are implemented.
Tests

But it's ready for a preliminary review to see if that direction is a direction that we even want to go in, before myself or someone else puts in additional work dotting I's and crossing T's. I didn't want to invest additional time until we had some agreement that it was worthwhile to finish it up.

I would argue that's a validation step, and so not even raw metadata should be making that call.

The way I think about serialization is it's job is to take some arbitrary bytes and turn it into the conceptually correct primitive type (this rules out parsing to Version or something).

So if we look at something like project urls, that is conceptually a mapping of key to URL, however because the RFC 822 format doesn't support mapping natively, we had to implement a secondary, field specific, serialization ontop of RFC 822 that let us represent a mapping.

So that's why that deserialization lives in the raw layer IMO, and to support this, I'm not aware of any of the major tooling that has ever implemented project URLs as a list of specially formatted strings. IOW, human beings have always (or almost always) been working on the field as a mapping, and the tooling serialized it to a list of strings.

Keywords is similar, except it's been implemented as a free form text field for long enough by major tools that I do think you could argue that it's not serialization when human beings were expected to enter that data already "serialized".

brettcannon commented 1 year ago

The proposed Metadata.to_(raw|email|json) functions, which should be tiny shims.

I don't think that needs to hold anything up. We were prepared to release with just the object definition and no support code. Honestly, just being able to read METADATA/PKG-INFO files will be a massive win for (hopefully) pip and other installers. We can than expand into METADATA production for builders later.

But it's ready for a preliminary review to see if that direction is a direction that we even want to go in, before myself or someone else puts in additional work dotting I's and crossing T's.

I can give it a review then!

The way I think about serialization is it's job is to take some arbitrary bytes and turn it into the conceptually correct primitive type (this rules out parsing to Version or something).

I agree, I think the question is what the "correct primitive type" is. In the specific cases you're suggesting, at least the format is extremely simple. But parsing of Project-URL, if you require a comma, can still fail. At least with keywords that parse can't fail as it will just end up being poorly split. So in my head, RawMetadata was to be something that simply could not fail based on the format of the metadata data (format of the actual container of that data could fail). And so for me, Project-URL was a potential failure point and thus a list of strings (albeit a small one). I assume that's why https://peps.python.org/pep-0566/#json-compatible-metadata only makes a special exemption for keywords as the parsing of that data can't trigger an exception (and so I can get behind keywords being split). But I do think we need to come to an understanding on that as this will be part of documentation and perpetual design of RawMetadata.

dstufft commented 1 year ago

Deserialization can always fail right? Like if you emit a JSON that looks like:

{
    "foo": "bar",
}

That's going to fail because the syntax is wrong, and likewise parsing RFC822 is fairly lenient, but even it can fail.

I don't see any reason why it's "OK" for {"name": "bar",} to fail, but not "OK" for ["project_url": ["some url without a key"]} to fail, and to be clear, "fail" doesn't mean raise an exception in RawMetadata, it means it doesn't get deserialized into RawMetadata and gets emitted into the leftover data structure.

The same thing happens if you send a float instead of a string for a version number for instance.

brettcannon commented 1 year ago

I don't see any reason why it's "OK" for {"name": "bar",} to fail, but not "OK" for ["project_url": ["some url without a key"]} to fail

I view the JSON issue at a different layer than the data itself. Plus, I don't see parsing Project-URL for a label and URL as any different than trying to parse Version. With that view, my question becomes why does Project-URL get special treatment to be eagerly parsed in RawMetadata but Version doesn't when both have an expected structure that may or may not work?

pfmoore commented 1 year ago

I believe that the JSON format should follow the definition in PEP 566, which means that keywords is the only special case[^1], and project_url should be a list of strings.

But it's ready for a preliminary review to see if that direction is a direction that we even want to go in

I'm happy to review it (can you remind me where to find it?)

@brettcannon your code at https://gist.github.com/brettcannon/731ddd584bad01a5ee678d332a932041 only seems to have a from_pyproject method for loading data from external sources, at the moment. So I assume it's not ready for review (at least, in terms of the questions about parsing external data we're discussing here?)

[^1]: It's a single-use field in the metadata, and its single value in email format is a string, but in JSON format, and in the "raw" object, it's a list of strings.

dstufft commented 1 year ago

I view the JSON issue at a different layer than the data itself. Plus, I don't see parsing Project-URL for a label and URL as any different than trying to parse Version. With that view, my question becomes why does Project-URL get special treatment to be eagerly parsed in RawMetadata but Version doesn't when both have an expected structure that may or may not work?

A few reasons

Version is not a primitive type, and RawMetdata only emits primitive types, but a dict[str, str] is a primitive type.
Nobody actually thinks of or implements Project-URL as a list of strings, conceptually it is a mapping of str to str, but RFC822 doesn't support mappings, so an extra layer of serialization had to be added.
- Every tool that I'm aware of handles project urls as a mapping, from setuptools to flit to pyproject.toml. The fact that Project-URL is a list of strings is an implementation detail of RFC822. If we were defining a JSON format from scratch, the most logical serialization of it would be as a mapping, not as a list of strings.
Sort of an extension of (2), but the primary input format for project urls is a mapping, while the primary input format for version is a string. Thus generally an error in turning project-url into a mapping would be an error in how a project serialized data, whereas an error in turning version into a Version would be an error in the data that a user inputted.

dstufft commented 1 year ago

I believe that the JSON format should follow the definition in PEP 566, which means that keywords is the only special case1, and project_url should be a list of strings.

RawMetdata is not "the JSON format", it's the programmatic format for deserialized, but not validated data.

One could imagine adding a YAML or a TOML form for serializing this data, and using a mapping to handle project-url.

I'm happy to review it (can you remind me where to find it?)

https://github.com/pypa/packaging/pull/574

brettcannon commented 1 year ago

I assume it's not ready for review

Correct, my code was just a proposal.

Version is not a primitive type, and RawMetdata only emits primitive types, but a dict[str, str] is a primitive type.

So is that what you would want the documentation to say as the guideline as to whether something gets any sort of parsing for RawMetadata? "Values are not validated, but when there is a simple, pragmatic representation of a value using Python's built-in types they will be used accordingly (e.g. a dict of strings)"? That way people won't ask for Provides-Dist to be a tuple of requirement and extra or something?

I think how to document the guideline we will follow is my sticky point in all of this. And to be clear, I'm after a guideline we can give users about how we will add typings going forward, and not a rule.

I believe that the JSON format should follow the definition in PEP 566, which means that keywords is the only special case1, and project_url should be a list of strings.

RawMetdata is not "the JSON format", it's the programmatic format for deserialized, but not validated data.

I also don't know how widely the JSON format from PEP 566 is used, so I'm not sure we should feel beholden to it regardless until we have explicit JSON serialization support.

dstufft commented 1 year ago

So is that what you would want the documentation to say as the guideline as to whether something gets any sort of parsing for RawMetadata? "Values are not validated, but when there is a simple, pragmatic representation of a value using Python's built-in types they will be used accordingly (e.g. a dict of strings)"? That way people won't ask for Provides-Dist to be a tuple of requirement and extra or something?

I think how to document the guideline we will follow is my sticky point in all of this. And to be clear, I'm after a guideline we can give users about how we will add typings going forward, and not a rule.

Yea I think something like that is the guideline I'd document.

Like I'd rule out Requires-Dist because it's conceptually a list of PEP 508 requirement strings, that's the user interface that is provided. User's aren't providing tuples of extras and requirements, they're providing a string.

pfmoore commented 1 year ago

I also don't know how widely the JSON format from PEP 566 is used, so I'm not sure we should feel beholden to it regardless until we have explicit JSON serialization support.

Maybe we should simply drop the conversions to and from JSON. If we're happy that the PEP 566 JSON format isn't actually used "in the wild", then let's drop it. We should just have "from_email" and "to_email" methods[^1] that read/write the email format (which is standardised), and leave everything else for users/3rd parties to write.

[^1]: I'm not 100% comfortable with the _email suffix. It suggests you can pass an email.Message instance, and it exposes an implementation detail of the format. But from_bytes_or_string is just clumsy. Is there a good naming convention for methods that take (or produce) bytes or string data?

dstufft commented 1 year ago

I'm fine dropping them, I included them both because the format was defined, and to validate the idea that the serialization would work for multiple formats.

I don't have a better name for those methods. I wouldn't want to use from/to bytes, because if we add json in the future that gets more confusing I think. Maybe from_rfc822 and to_rfc822 or something? I dunno.

pfmoore commented 1 year ago

The fact that the format is defined is what makes me hesitant. But I'd want to frame it as from and to a dict, and leave serialising the dict to the user. That would potentially be reusable for formats other than JSON, and there is some complexity in the dict <-> RawMetadata conversion, whereas serialising to/from JSON is covered by the json module (and 3rd party alternatives, if speed matters to you). Also, a dict format lets people use "unofficial" serialisations like YAML if they want.

The problem is, once we go from dict to RawMetadata, we have three in-memory formats (dict, raw metadata and parsed metadata), which is getting silly. But is it any more silly than people converting metadata to a dict by converting to in-memory JSON and deserialising the JSON? Which is what I considered doing for my own code...

Yeah _rfc822 is marginally better IMO, but only marginally. The problem is that we don't actually have a name for the format. "Metadata" is used as a general term, not specific to the file format. And really, "email" and "RFC822" are just inaccurate, as the format is actually a (mangled) subset of those formats. But I guess if method names are the worst problem we face, we've pretty much won 🙂

dstufft commented 1 year ago

To be clear, my branch at least doesn't expose any dict <-> RawMetadata conversion, it currently supports:

rfc 822 bytes <-> RawMetdata
json bytes <-> RawMetadata
Metadata <-> RawMetadata

It has some helper methods to let people skip the interim RawMetdata steps, and do:

rfc 822 bytes <-> Metadata json bytes <-> Metadata

RawMetadata is "just" a TypedDict though, so it allows all of the same thing you've mentioned for dict, it's just we've got functions that handled the serialize / deserialization, because it's the only way to handle some logic (like the keywords special case).

If someone wants an unofficial serialization, they just write their own functions that add like yaml <-> RawMetdata, and then they can use that the same asthey would anything else.

pfmoore commented 1 year ago

RawMetadata is "just" a TypedDict though, so it allows all of the same thing you've mentioned for dict

Oh cool. I didn't know that's how TypedDict worked. Off to read the manuals! 🙂

dstufft commented 1 year ago

TypedDict is "just" a dict, but in mypy it has special behavior (known keys, with known value types), but at runtime it's just a dict.

pfmoore commented 1 year ago

Ah, so it doesn't allow attribute-style access. OK, the definition syntax is misleading (to me) then, but fair enough.

brettcannon commented 1 year ago

Anyone know how to test for multi-part bodies? I copied this code from Donald's PR, but I can't figure out how to trigger the failure case:

https://github.com/brettcannon/packaging/blob/bd86a215cef8cd9f94a9692385ea3973e877cd85/src/packaging/metadata.py#L149-L151

pradyunsg commented 1 year ago

Passing in Message().attach(1234) should trigger that.

brettcannon commented 1 year ago

Passing in Message().attach(1234) should trigger that.

But is there any way to do it via some METADATA format? If there isn't then maybe the code isn't necessary since the API only take strings or bytes?

dstufft commented 1 year ago

Uhh, I probably added that because something on PyPI made it happen when I was testing it against PyPI data... but I don't remember what.

dstufft commented 1 year ago

At least, I don't think I would have gone out of my way to do that on my own.

FFY00 commented 1 year ago

Trying to serialize an object with a list payload, at least, doesn't work. It trips up in

https://github.com/python/cpython/blob/3ef9f6b508a8524f385cdc9fdd4b4afca0eac59b/Lib/email/generator.py#L238-L239

brettcannon commented 1 year ago

https://github.com/pypa/packaging/pull/671 covers RawMetadata and parse_email().

brettcannon commented 1 year ago

Raw metadata parsing just landed in main! Next is providing enriched/strict metadata.

dstufft commented 1 year ago

🎉

nilskattenbeck-bosch commented 1 year ago

We can also ignore non-standard fields (e.g. License-Files).

If PEP 639 gets accepted this becomes a standard field

brettcannon commented 1 year ago

@nilskattenbeck-bosch correct, but that hasn't happened yet (I should know, I'm the PEP delegate 😁). Once the field exists we will bump the max version for metadata and then update the code accordingly.

nilskattenbeck-bosch commented 1 year ago

May I also suggest adding a small information box to the documentation as to why the function is called parse_email. By now I read through the corresponding PEPs and specifications to understand that the metadata is serialized just like email headers and parsed using that module though at first it was really unnatural and felt like I was using the wrong function and that it would leak abstraction detail. If other parse_FORMAT will be introduced then this naming makes sense though having an assurance that this is the correct method and why it is named that way would be assuring.

brettcannon commented 1 year ago

May I also suggest adding a small information box to the documentation as to why the function is called parse_email.

I would go even farther and say you can propose a PR to add such a note. 😁

brettcannon commented 1 year ago

With my work for reading metadata now complete, I'm personally considering my work on this issue done. Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this.

abravalheri commented 1 year ago

With my work for reading metadata now complete, I'm personally considering my work on this issue done.

Thank you very much @brettcannon for working on this.

Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this.

There was at least one previous PR (https://github.com/pypa/packaging/pull/498, probably more) that tried to address the "writing" capabilities for metadata, however these were closed because probably they don't fit into the long term vision that the maintainers would like for the API/implementation (which is a very good thing to have in a project).

Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it? (This question is targeted at all packaging maintainers, not only Brett 😝).

I am just concerned that throwing PRs at packaging without a clear design goal/acceptance criteria will just result on PRs getting closed and work hours being lost.

brettcannon commented 1 year ago

Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it?

I think the first question is what are the requirements for the feature to write metadata? Do you need to minimize diffs for METADATA output that you originally read from? Or should the output be consistent across tools (i.e., field order is hard-coded)?

The other question is whether tools will build up the data in a TypedDict via RawMetadata and then use Metadata to do the writing, or do you make it so you build up the metadata in Metadata itself? And if you do the latter, do you do validation as you go, or as a last step before creating the bytes? I will admit I totally punted on this one based on how I coded Metadata and it currently lends itself to building up via RawMetadata as the descriptor I used is a non-data descriptor and thus the caching does not lend itself to attribute assignment.

There's also how much you want to worry about older metadata versions? Do you let the user specify the metadata version you're going to write out and thus need to do fancier checks for what's allowed?

For me, I think consistency in output is more important than keeping pre-existing METADATA files from having a smaller diff (since long-term that will happen naturally). As for the API, I think that's up to people like you, @abravalheri , who author build back-ends since you will be the folks using any construction API. But what I will say is if you build up a RawMetadata dict then an API to generate the bytes is surprisingly straightforward to code up (define the descriptors in the order you want them written out, iterate over Metadata for all instances of _Validator, get the values, and then introspect on the results to know how to write out the format appropriately for strings vs lists vs dicts).

dstufft commented 12 months ago

I'd also mention that everything I've done with METADATA has primarily been consuming them not emitting them, but my intuition is that it should have these properties:

Don't worry about diffs produced in round tripping, I suspect that use case is going to be extremely niche (but I could be wrong) and I wouldn't spend a lot of time worrying about it.
- I think at the extreme edges of this, you basically need a special parser that preserves the original formatting, which the email parser doesn't have nor does any of the stdlib parsers, making it a much bigger deal to do.
I would have the write APIs mirror the same pattern as the read APIs, in that:
- You can work with a Metadata object, and that object maintains the requirement that you can only get or set valid metadata to it. This Metadata object has a Metadata().to_raw() function that returns a RawMetadata.
- You can work with a RawMetadata object, which is a dict at runtime or a TypedDict at type check time, and there is minimal validation here (basically just that primitive types match).
- We have a write_email, function that takes a RawMetadata and returns the serialized bytes, probably also any leftover fields that it didn't know how to serialize.
- We have Metadata().to_email() functions that act as a wrapper around Metadata().to_raw() and write_email.
I would let the user set a metadata version, but if they don't set one then use a default value.
- Of course validating that the fields used exist in the given metadata version.
- A good question is whether the default should be the latest version or if we should attempt to do something like the minimal version that contains the fields that are in use.
- Personally I think we should do the latest version, but with the caveat that maybe whenever a brand new version is released, we should hold off on setting the default to that for some period of time to give tools time to catch up? But maybe we would have to implement version detection then incase someone uses a new version, so it might be simpler/better to just always emit the latest by default.

dstufft commented 12 months ago

Oh ugh, emitting probably also has to answer the newlines in fields question, and possibly the other issues I raised with the spec earlier up thread.

dstufft commented 11 months ago

Just a note, I have a PR up now to Warehouse (https://github.com/pypi/warehouse/pull/14718) that switches our metadata validation on upload from a custom validation routine implemented using wtforms to packaging.metadata.

The split between Metadata and RawMetadata was super useful, since we (currently) have the metadata handed to Warehouse as a multipart form data on the request, I was able to make a custom shim to generate a RawMetadata from that form data, and then just pass that into the Metadata.from_raw().

I'm still giving it a through set of manual testing, but so far it looks like other than a few bugs/issues that fell out (https://github.com/pypa/packaging/issues/733, https://github.com/pypa/packaging/issues/735) integrating it was relatively painless. Of course the real test will be when it goes live and that wide array of nonsense that people upload starts flowing through it.

We're not yet using the actual parsing of METADATA files yet (though that is planned), so this is strictly just the validation aspect of the Metadata class that we're currently using.

pradyunsg commented 11 months ago

Do you think it might make sense to keep both the old code and the new code paths running for a bit on warehouses' end, with the results of the old code path being returned?

i.e.

def upload(...):
    try:
        new_result = new_upload(...)
    except Exception:
        new_result = ...
    old_result = old_upload_handler(...)
    if new_result != old_result:
       log.somehow(old_result, new_result)
    return old_result

pypa / packaging

Plans for `packaging.metadata` #570