Plans for `packaging.metadata`

pypa / packaging

Core utilities for Python packages

https://packaging.pypa.io/

Other

611 stars 244 forks source link

Plans for `packaging.metadata` #570

Open brettcannon opened 2 years ago

brettcannon commented 2 years ago

[x] Parse "raw" metadata (https://github.com/pypa/packaging/pull/671)
[x] Parse "strict" metadata
[ ] Emit "strict" metadata

brettcannon commented 2 years ago

@abravalheri @pfmoore when you say you want more lenient handling, what does that mean to you? I think we are going to want to use exception groups so that all issues can be gathered and reported to the user instead of just the first error (@pradyunsg should we take a dependency on exceptiongroup or have our own ExceptionGroup shim class?). That leads to a possible solution where there's an option of returning the exception group so the caller can choose how to handle the problems (with all the relevant data saved to the exceptions). What we could then do is something like:

@overload
def from_email_headers(..., raise_exc=True) -> Self: ...

@overload
def from_email_headers(..., raise_exc=False) -> (Self, ExceptionGroup): ...

In the end the only code necessary to accommodate this is some branching at the end of the function:

if raise_exc:
    if exception_group:
        raise exception_group
    else:
        return metadata_
else:
    return metadata_, exception_group

I contemplated only having the lenient return type, but I realized that's not a great API if you forget to raise that exception group if that's what you truly want compared to getting a TypeError that there are not enough values to unpack for metadata_, exc_group = Metadata.from_email_headers(..., raise_exc=True).

As for what exceptions are returned within the exception group, I was thinking of all of them being InvalidMetadata which stores the core metadata name, value, and the reason why it failed (either a string or setting __cause__ as appropriate, e.g. InvalidSpecifier).

BTW I am side-stepping if there would be such a thing as an unrecoverable error no matter what, like leaving out the metadata version, project name, etc.

pfmoore commented 2 years ago

It means that I want to be able to parse metadata with (for example) legacy versions and not have an InvalidVersion error raised, but instead get the value as a string. Returning exceptions rather than raising them is of no value for my use cases.

For me, there are two parts to parsing metadata (from email, for example):

Parse the email format into name-value pairs for the various metadata fields. At this stage, the "value" is a simple string value (or list of such values).
Validate the values which have a defined format, and convert them into structured types as appropriate.

I consider "lenient" parsing to be just doing step 1, so that all possible values can be processed, regardless of validity. Strict parsing involves doing step 2 as well, and rejecting invalid values.

I tend to think that "lenient" parsing is important, on the basis of the "be lenient in what you consume and strict in what you produce" principle. When reading a file, you want to do whatever you can to get from the raw data to at least a name-value structure, so you can do more detailed processing on a simple but structured representation. Maybe you'll want to reject invalid values, maybe you'll want to spot them and report them, maybe you'll want to analyze patterns in what sort of deviations from valid values are present. The point is, being able to get the raw data into an internal form is key here. Before you write data, though, you want it to be in a valid form (usually - there are use cases for simply reading and re-writing data and allowing invalid values to pass through unmodified, but they are specialised). So you store the data in a validating internal data structure, that catches any dodgy values as early as possible, and then write it out knowing you're only writing valid data to the external form.

Being able to read data from an external form, and then convert it to the validating internal structure (with errors being flagged at conversion time) may well be the more common pattern when working with a single metadata file. But when working with metadata files in bulk, the pattern is (IMO) much more likely to be "read in bulk without validation, then validate in bulk".

My pkg_metadata project is focused on only doing step 1, with the dictionary format being the "unvalidated key-value" data structure. I would still happily contribute it to packaging, with step 2 then being about converting the dictionary format to the Metadata class. And probably some convenience functions to do steps 1+2 together, for people who don't need the separation. But if you don't agree with my split into steps 1 and 2, I can keep pkg_metadata as a separate project for people who need that extra flexibility (I'll probably add something to do step 2, converting to the packaging Metadata class once it has stabilised, in that case).

abravalheri commented 2 years ago

@abravalheri @pfmoore when you say you want more lenient handling, what does that mean to you?

I agree with Paul's definition of not raising an error. It would make sense to parse whichever fields are possible on a best effort basis and populate the parsed information into the structured object. A tuple with an exception group also make sense, so a build-backend for example could use it to produce its own warnings.

In my mind, some examples of things that are not correct but can be parsed on "a best effort basis" are:

"long text" fields (e.g. Description) when the indentation does not strictly follow the " " * 7 + "|" rule (e.g. when the indentation is " " * 8).
"outdated" forms of a field's key or value, when the metadata version is inconsistent, e.g.:
- ~PEP 314 => PEP 345 renaming, such as: Requires => Requires-dist~
- Requires-Dist using a parenthesised version without operator (e.g. dep (1.2))
Multi-line Summary (we can simply join multiple lines when parsing the value).

We can also ignore non-standard fields (e.g. License-Files).

(I tried to capture some of this "permissiveness" on the implementation of https://github.com/pypa/packaging/pull/498. Probably not the best code in the universe, but please feel free to steal if you find something useful).

merwok commented 2 years ago

(Just to avoid confusion: Requires from PEP 314 is not just an old name for Requires-Dist, because it expected an importable name for value, not a project name. So these are really two separate fields, one obsolete (can be discarded) and one current.)

pfmoore commented 2 years ago

To the list of "things that are not correct but may not be a problem", can I add

Invalid values (e.g., incorrectly formatted specifiers) in any field, when you're not interested in that value. Very few applications in my experience actually care about the actual value of (for example) the "Author" and "Author-Email" metadata read from an existing distribution, but wanting to copy that data unaltered (or even just to display it unaltered!) is a fairly reasonable thing to hope for.

brettcannon commented 2 years ago

Returning exceptions rather than raising them is of no value for my use cases.

Fair enough, but the exceptions would still cause you to have access to the invalid data via their raw strings. So I think this is coming more down to you want strings all the time while I want to provide structured data where possible. I acknowledge both approaches have legitimate uses, and so I will stop bugging you about this. 🙂

A tuple with an exception group also make sense, so a build-backend for example could use it to produce its own warnings.

👍

We can also ignore non-standard fields (e.g. License-Files).

I honestly wasn't planning to strictly check non-standard field names anyway. The biggest chance of failure here is something packaging itself can't parse (e.g. the dep (1.2) example). But hopefully returning the incorrect strings for introspection for tools decide what extra effort they want to put in is flexible enough.

pfmoore commented 2 years ago

Fair enough, but the exceptions would still cause you to have access to the invalid data via their raw strings.

Maybe. But my core use case looks something like this:

structured_metadata = {filename: msg_to_metadata(get_metadata_bytes(filename) for filename in list_of_filenames}

With a string-only ("phase 1 only") API, that "just works". Validation can happen in a later stage, when processing a particular file. How would you do that with an API that only does the full conversion to a validated structure, returning exceptions when needed, in such a way that I can later write the data out unmodified?

The sort of use case I have in mind is processing a bunch of wheels off PyPI (using HTTP range requests to only pull the METADATA file out of the wheel and not download the whole lot, running parallel downloads for speed), processing the metadata to report "interesting statistics", and then saving the structured data to a database (where it'll all be serialised back to strings) for further future analysis. That structured_metadata dictionary is the in-memory data structure I'd use for that processing and then to serialise from.

Anyway, like you say, both approaches have legitimate uses, so we can agree to differ. I'm mostly just a little sad because you will have to internally read the metadata into strings before converting to types like Version and Requirement, and yet you prefer not to expose that intermediate state meaning that people wanting string-only values have to reimplement that code. But I have that reimplemetation in pkg_metadata, so there's no more extra work now anyway.

pradyunsg commented 2 years ago

We don't need to have a single object represent both "raw" and "validated + parsed" metadata, FWIW.

Provide a "low level" parse function, that just does the email.FeedParser stuff, from the raw underlying string/bytes and provides a value (either the Message as is or a different class, StringValuedMetadata or something, that's aware of keys and does a lookup on attribute access mirroring the Metadata class's attributes).
Metadata will perform the validation during initialisation. It would also grow a classmethod to allow initialising through that in addition to being able to be initialised from the raw string/bytes as well. We'll keep the exception groups as well.

This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here.

That's a design if we want to solve both use cases in a single module. I think it's worthwhile, but I won't make that call unilaterally of course. :)

should we take a dependency on exceptiongroup

Meh... It depends on flit_scm for its own build, which in turn uses setuptools_scm and flit_core. I'm ambivalent on this because it's gonna mean that we have redistributors feel more pain and, possibly, enough of it that they come complaining about it. :)

pfmoore commented 2 years ago

This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here.

That was my suggestion, with the "json-compatible dictionary" format specified in the metadata PEP as the "raw" form.

brettcannon commented 2 years ago

that just does the email.FeedParser stuff

So just somewhat of a one-off for email headers, or would we eventually want it for pyproject.toml as well? I guess the code can guide us as to whether it's easier to normalize to JSON for everything and have Metadata take that in to do the conversion to the appropriately types and do validation.

either the Message as is or a different class, StringValuedMetadata or something

I would assume we would have a TypedDict defined for the JSON dict format.

When we were originally discussing dict vs. data class, I was thinking "one or the other," not "both". That does simplify the story and structure of the overall approach by normalizing to JSON as the low-level input for Metadata, but at the cost of more code to manage the same data, just in different formats. But since we don't exactly change the core metadata spec that often the duplication shouldn't be burdensome.

This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here.

That was my suggestion, with the "json-compatible dictionary" format specified in the metadata PEP as the "raw" form.

Does that work for you as well, @abravalheri ? I would assume we would ditch the overrides on Metadata.from_email_headers() in preference to this two-step process of getting from METADATA to Metadata.

@pfmoore might actually get what he wants. 😉

abravalheri commented 2 years ago

Does that work for you as well, @abravalheri ? I would assume we would ditch the overrides on Metadata.from_email_headers() in preference to this two-step process of getting from METADATA to Metadata.

It is hard to give a informed opinion before working with the API, so please take the following with a grain of salt:

I don't mind having to do more steps to use the API, but I would say that for the use case I have in mind[^1], from_XXX(..., raise_exc=False) doing "best effort" parsing and returning a tuple with a (potentially incomplete) structured metadata object is what I am really looking for.

I don't think a build backend fits in the category of consumers that want to copy the data unaltered (or even just to display it unaltered). We can consider that a backend want to parse the PKG-INFO into an structured object, use this object for the build and then use the API to create a valid METADATA file...

So if the only way of not running into errors is to use the a raw TypedDict, the backend would have to re-implement the logic that converts the TypedDict into an structured metadata object...

A minor comment about a previous topic of discussion:

The biggest chance of failure here is something packaging itself can't parse (e.g. the dep (1.2) example). But hopefully returning the incorrect strings for introspection for tools decide what extra effort they want to put in is flexible enough.

If there are limitations on what packaging can parse, that is fine. An backend using this API can decide to ignore all the fields that are not essential for a --no-binary installation (e.g. author, summary) and try to do something about the fields that are essential (e.g. requires-dist) based on the raw value in the ExceptionGroup. The important part is that a (potentially incomplete) structured data object is returned, so the backend does not have to re-implement the entire conversion logic.

By the way, if packaging cannot parse Requires-dist: dep (1.2), this also means that this API will not be able to parse Metadata-Version <= 1.2, right?

[^1]: a build backend parsing a PKG-INFO file from an old sdist to produce a wheel to be used by the installer. I am assuming this is necessary to support pip install --no-binary or when no wheel is available in the package index for an specific operating system/processor architecture.

brettcannon commented 2 years ago

I don't mind having to do more steps to use the API, but I would say that for the use case I have in mind1, from_XXX(..., raise_exc=False) doing "best effort" parsing and returning a tuple with a (potentially incomplete) structured metadata object is what I am really looking for.

OK, so you would appreciate the exception group with all the failed metadata still being returned.

So given email headers as the input, we have:

Get me the data out of email headers, but don't process anything (Paul)
Do your best to process the data into some structure, but give back the left-overs (Anderson)
Be strict about the data (... no one?)

Maybe I have been thinking about this the wrong way by trying to generalize everything such that every metadata format should be treated the same instead of just treating every format as unique (and to hell with the complexity of it all)? Thinking about the formats and reading/writing them, you have:

pyproject.toml (strict read, no write)
PKG-INFO (loose read, strict write)
METADATA (loose read, strict write)
JSON/metadata.json (??? read, strict write)
Direct object creation (strict read in the sense of typing will make sure you don't mess it up)

So maybe from_email_headers() should only return (Self, dict[str, str]) with as much processed data as possible in Self and a dictionary with everything that wasn't used/usable (with the keys all normalized to lower-case)? I don't know what we would want to do with data-related errors like something in Dynamic that actually shouldn't be there in the PKG-INFO case (let it through, return it in the dict?). But since the tools reading PKG-INFO/METADATA seem to want leniency, it might be best to just embrace it and let tools decide if they care whether the left-overs dict wasn't empty?

By the way, if packaging cannot parse Requires-dist: dep (1.2), this also means that this API will not be able to parse Metadata-Version <= 1.2, right?

It can parse dep (>=1.2) just fine; the lack of version specification is the issue.

>>> from packaging import requirements
>>> requirements.Requirement("dep (>=1.2)")
<Requirement('dep>=1.2')>
>>> requirements.Requirement("dep (1.2)")
Traceback (most recent call last):
  File "/tmp/temp-venv/lib/python3.9/site-packages/packaging/requirements.py", line 102, in __init__
    req = REQUIREMENT.parseString(requirement_string)
  File "/tmp/temp-venv/lib/python3.9/site-packages/pyparsing/core.py", line 1141, in parse_string
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected string_end, found '('  (at char 4), (line:1, col:5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/temp-venv/lib/python3.9/site-packages/packaging/requirements.py", line 104, in __init__
    raise InvalidRequirement(
packaging.requirements.InvalidRequirement: Parse error at "'(1.2)'": Expected string_end
>>> requirements.Requirement("dep (==1.2)")
<Requirement('dep==1.2')>

dstufft commented 2 years ago

PyPI would want strict validation FWIW

The way I'd implement this is that you have a variety of functions that read a file and turn it into a "standard" shaped dictionary with primitive keys / values. No validation would be done at this point so it's completely YOLO, the only different between this and a hypothetical json.loads() is that any differences in serialization are papered over so that all of the different metadata methods return a "standard" dictionary. Call these the parse_FORMAT function, so like parse_json, parse_email, parse_toml, etc.

So, they'd have a function signature that looked something like:


from typing import TypedDict, Optional

class ParsedMetadata(TypedDict):
    name: Optional[str]
    # Everything is optional, everything is primitive types.

def parse_FORMAT(content: str) -> ParsedMetadata:
    pass

There are a few subtle things here that I would suggest:

If a field is missing, or otherwise unable to be read from the raw format (see next point) they are just omitted from the returned dictionary rather than generating some kind of error.
Part of papering over different serializations will be parsing types, for instance if Project-URL is chosen to be a dictionary, then pase_email will need to convert ["Key, URL"] into {"Key": "URL"} [^1].
Extra fields that we don't know about are omitted from the returned value [^2].

Then a classmethod can be added to the existing Metadata class, like:

class Metadata:
    # ...

    @classmethod
    def from_parsed(cls, parsed: ParsedMetadata) -> Metadata:
        pass

This method consumes the parsed value, validates all of the metadata, then returns the validated Metadata object.

I wouldn't actually have the raises keyword, I would just have this method always raise, but attach the incomplete result to the raised exception. That satisfies the same use cases but avoids a boolean parameter and the need to deal with overloading return types.

I think this has a lot of nice benefits:

People who want something that will do as little parsing as possible, but still provide some semblance of a unified type, if you don't care about a unified type, then you're best just reading the files directly [^3].
We cleanly separate concerns from parsing the file contents and massaging them into a unified shape from validating the primitive types and turning them into more complex, specific objects.
- This has the added bonus that new formats can be implemented/tested outside of packaging or even custom formats.
We handle all 3 "levels" of validation that people have expressed an interest in.

For someone that wants the strict validation, they would ultimately call this like:


meta = Metadata.from_parsed(parse_email(content))

We could offer some shortcut methods, that do:


class Metadata:

    @classmethod
    def from_email(cls, content: str) -> Metadata:
        return cls.from_parsed(parse_email(content))

Which then gives people the option for the best of both worlds.

[^1]: This means that technically some malformed metadata fields will be unable to be read using these methods since they'll fail to parse and be omitted. We should strive to make our parsing as lenient as we reasonably can (for instance, the Project-URL example, make sure we work in cases where the URL or key is missing, the comma is missing, etc to produce something). If these lenient methods aren't lenient enough, then you'll have to implement it yourself (but the layering here means if your implementation emits a ParsedMetadata it can be used in the other methods). [^2]: We could allow this; my biggest concern would be that we can't enforce any kind of typing in it which may be a footgun for consumers of this API. Maybe an option would be a _extra key that is just dict[str, Any] or have the parse_FORMAT return two values, ParsedMetadata and the left over items from the dict. [^3]: This is because there is no such thing without massaging the data to have a return type that returns the same data for every serialization format. Formats like email are going to treat everything as strings, or list of strings, but a format like json or toml will support other primitive types, so they can't return str unless we cast them to string, which would be silly.

pfmoore commented 2 years ago

Thanks @dstufft - this is exactly the model I've been arguing for all along (and I agree, conversion to metadata should always raise - which I think is what @brettcannon might mean when he says "strict", and I think he believes I object to...)

I've been thinking some more about this, though, and I think that @brettcannon has a point when he says maybe the logic should depend on the input. From my point of view:

The pyproject.toml format should always be validated in practice. I still think the "parse to unvalidated" -> "classmethod to construct with validation" separation is logically cleaner, but I don't see a use case for parsing pyproject.toml without validating, so in practice I don't see anyone using just one without the other. The pyproject format is modern, and anyone entering data in that form will be expecting a "modern" experience, including validation to the latests standards.
The metadata.json format is a slightly odd one, because it's already designed for machine processing. In pkg_metadata I don't even need a parser for it, because my "unvalidated" format is the json-compatible dictionary. But if we go with @dstufft's idea of an unvalidated ParsedMetadata class, we'd need to convert. But either way, the only processing that adds value (IMO) is validating - the raw form loaded with json.load is sufficient for unvalidated processing. Which I think in @brettcannon's terms means that I support validating the JSON format...
The killer is the email format. This is sufficiently tricky to process correctly that having a library that does it is immensely valuable. But it's been used for so long that there's a lot of data out there that is invalid[^1] by today's standards, and being able to read that is still important for backward compatibility if nothing else. So I think that separating raw parsing and validation is important for this case.

There's also another assumption that I've been making, which I think should be made explicit. The Metadata class itself should be validated. At the moment, it isn't - it has type hints, but there's no runtime validation on field content. And I don't think we should assume that everyone uses a type checker, or that it's OK to omit validation because fields have type annotations.

So, for example, I believe the following code:

m = Metadata("foo", Version("1.0"))
m.requires_python = 3.7

should raise an exception. I don't think there should be any choice here, the attribute should be validated on assignment. The same goes for m = Metadata("foo", "1.0", requires_python = 3.7) or m = Metadata("foo", "1.0", requires_python = "invalid_specifier).

This is important because there will be tools that want to accept direct user input for metadata fields, not from a file format, but from things like command line options (build_wheel --name foo --version 1.0 --requires_python 3.7). Those tools should not be able to construct incorrect Metadata objects by forgetting to validate their inputs (and as I said, we should not assume that all callers will run a type checker).

I'd also point out (although it's somewhat off topic here) that I believe that m = Metadata("foo", "1.0") should work. That is, I think that we should accept string values and convert them to the appropriate classes, using the constructor for that class. It's not just the convenience of omitting the constructor, it's the fact that you don't even need to import the class yourself. While I'd expect careful applications to do the conversion themselves (to control error handling), for adhoc scripts and use in the REPL, passing strings is just way more convenient[^2].

[^1]: One aspect of "unvalidated" here is that not all legacy email format input is encoded in UTF-8. So a parser should take care to accept arbitrary non-UTF-8 content in fields. This is impossible in full generality, but it is possible to support round-tripping, either by falling back to latin-1 (which can accept any byte value) or using the surrogateescape handler.

[^2]: This is somewhere I think type hints are lacking - what I want is to say "can be assigned anything that can be converted to a Version, but I'll do the conversion so the attribute itself is always a Version". Rust has the into and from traits to express this (if I understand them correctly) but Python can't do this easily.

dstufft commented 2 years ago

I think that even for cases like the hypothetical parse_json function, there's still value in having that function over just json.loads, even if ultimately, it's a thin wrapper over json.loads. The biggest win I think we get from that is normalization into the ParsedMetadata dictionary.

The benefit of that is most obvious for the email parser, because the email format isn't sufficiently expressive enough to have all the primitive types we might use natively, so we need to do some extra parsing. However, the same could be true for a "good" serialization format like a hypothetical JSON.

Let's say for instance that ParsedMetadata decided that keywords should be set[str], JSON doesn't have a way to represent a set, so the parse_json function would translate the list[str] to set[str] (I'm assuming we'd make the sensible choice to represent a set as a list in json).

Another maybe less contrived example is that it's possible that we (purposely, or accidentally) end up with different key names in different serialization formats.

The underlying thing is that if we try to say that we don't need parse_json because json.loads() gets us something close to ParsedMetadata, we're basically saying that we expect every serialization format we might ever support to happen to emit things in exactly the same way, which I don't think is good assumption to make.

I also think that while obviously the biggest problem with "bad" metadata is going to be the email based fields because of our history here, it seems silly to assume that tools aren't going to have bugs that generate bad metadata that 10 years from now we'll end up wanting a lenient parser for, for all the same reasons we have today. Particularly since the cost of doing so seems to be pretty minimal? And since we'll want it for email metadata anyways, if we don't implement it that way, we're forced to be inconsistent in how we implement different serialization types.

I'm personally kind of sketch on the idea of using these APIs for pyproject.toml. My concern is that while some of the data there is the same, I think that conceptually pyproject.toml is a different (but related) thing all together, which I think is further supported by the fact that they're two different specifications on packaging.python.org and those specifications have different rules for validation [^1].

If we are going to combine pyproject.toml and METADATA into a singular class, then I don't think that class can actually validate the metadata, because it needs to know what rules it's operating under in order to do that. Which I would probably expose two additional functions:


def validate_as_pyproject(meta: Metadata) -> bool:
    ...

def validate_as_core_metadata(meta: Metadata) -> bool:
   ...

Which implements the specific logic for the two different rules.... but really I think that pyproject.toml and the various METADATA flavors should just be sperate types.

An interesting thing about this all, is that we can add a method to Metadata that turns a Metadata instance into a ParsedMetadata instance, then have functions to emit that back into the underling serialization format. Doing that it may make more sense to rename ParsedMetadata to something like RawMetadata.

Putting this all together, my suggestion would be:


from typing import TypedDict

class RawMetadata:
    ...  # Typed so that everything is optional, and only using primitive types.

class Metadata:
    ...  # This is typed so that required things are required, optional things are optional.

    @classmethod
    def from_raw(cls, raw: RawMetadata) -> Metadata:
        ...

    @classmethod
    def from_email(cls, data: bytes) -> Metadata:
        raw, _leftover = parse_email(data)
        return cls.from_raw(raw)

    def as_raw(self) -> RawMetadata:
        ...

    def as_email(self) -> bytes:
        return emit_email(

def parse_email(data: bytes) -> (RawMetdata, dict[str, Any]):
    ...

def emit_email(raw: RawMetadata) -> bytes:
    ...

This gives us tons of flexibility, sensible layers to our implementation, and still easy to use functions for people to use when they don't care about those layers.

[^1]: For instance, pyproject.toml allows version to be specified as a dynamic value, but nowhere else is that actually allowed, and for practical reasons, if you've got a set of bytes that are a pyproject.toml, you don't have a good way to even know if it contains this metadata without first parsing it, since the [project] table is optional. So, my assertion here is that pyproject.toml is a different thing from METADATA and friends, one is the input, and one is the output, but they should re-use a lot of the same primitives (Version, etc).

brettcannon commented 2 years ago

PyPI would want strict validation FWIW

It's worth a lot! 🙂

The way I'd implement this is that you have a variety of functions that read a file and turn it into a "standard" shaped dictionary with primitive keys / values.

That's what I was thinking for a time yesterday, and that's what I'm considering in the email header case. I don't want to make it universal, though, as other formats like pyproject.toml already have some things structured appropriately and having to redo it seems like a waste (e.g. author details).

If a field is missing, or otherwise unable to be read from the raw format (see next point) they are just omitted from the returned dictionary rather than generating some kind of error.

Yep, as that's how you get permissible/loose management of the data.

Part of papering over different serializations will be parsing types, for instance if Project-URL is chosen to be a dictionary, then pase_email will need to convert ["Key, URL"] into {"Key": "URL"}

That does go against what's in PEP 566.

What I'm more concerned about is any of the structured data in pyproject.toml which doesn't map out well to what METADATA would produce without parsing it. Is that clean information to be tossed out, converted to a string to be parsed again? Supported in multiple ways in the JSON dict and you isinstance() to know how it's structured?

Extra fields that we don't know about are omitted from the returned value 2. We could allow this; my biggest concern would be that we can't enforce any kind of typing

Those keys would actually not even be typed as they wouldn't be represented by the TypedDict. As such, they can go in and you would have to actively work around the typing to get at them. Otherwise, as you said, they can get shoved somewhere that has very loose typing of str | list[str].

I don't think we should assume that everyone uses a type checker, or that it's OK to omit validation because fields have type annotations.

There would very likely be a validation step on output.

An interesting thing about this all, is that we can add a method to Metadata that turns a Metadata instance into a ParsedMetadata instance, then have functions to emit that back into the underling serialization format.

What I have in my head is email headers get converted to JSON, and then we work with JSON for Metadata. For pyproject.toml, it stays a separate concept for Metadata.

email headers -> JSON -> Metadata
JSON -> Metadata
pyproject.toml -> Metadata

Essentially we use JSON as the lossy input format and work with pyproject.toml directly as the strict input format.

dstufft commented 2 years ago

That's what I was thinking for a time yesterday, and that's what I'm considering in the email header case. I don't want to make it universal, though, as other formats like pyproject.toml already have some things structured appropriately and having to redo it seems like a waste (e.g. author details).

I'm confused what you would be redoing if it's already structured appropriately?

If the result of json.loads() is already in the same format as RawMetadata, then your parse_json function basically just looks like this:

def parse_json(data: bytes) -> (RawMetadata, dict[str, Any]):
    raw: RawMetadata = {}
    leftover: dict[str, Any] = {}

    for key, value in json.loads(data).items():
        if key in KNOWN_FIELDS:
            raw[key] = value
        else:
            leftover[key] = value

    return raw, leftover

You're not redoing anything there, you're just spliting the keys up into raw or leftover, and if it's decided to just leave the unknown keys inside the RawMetadata instead of splitting them out, then your parse function is as simple as:

def parse_json(data: bytes) -> RawMetadata:
    return json.loads(data)

Is it kind of silly to have this extra function call? Yea slightly, but it provides better typing but more importantly it keeps things consistent. There are no special cases here. 10 years from now, if we all decide that JSON is lame and we want to use FSON, but FSON doesn't have quite the same capabilities as JSON, we can easily just drop in a new parse function, parse_fson, and the API still feels natural. We don't have to have this weird "convert everything to some python representation of deserialized json of what we thought would be easy to convert 20 year old data serialized as email headers into" lurking around forever.

More importantly, I think there's a fundamental assumption here that the input to json.loads() is "correct" because it's from a file called metadata.json (or that pyproject.toml is correct because it's called pyproject.toml), and in my example the parse_json functions would still be doing some very basic validation.

For example, this is probably more realistically what that function would look like:


def parse_json(data: bytes) -> (RawMetadata, dict[Any, Any]):
    raw: RawMetadata = {}
    leftover: dict[Any, Any] = {}

    parsed = json.loads(data)
    if not isinstance(parsed, dict):
        raise ValueError("data is not a json mapping")

    for key, value in parsed.items():
        if key in KNOWN_FIELDS and isinstance(value, KNOWN_FIELDS[key]):
            raw[key] = value
        else:
            leftover[key] = value

    return raw, leftover

That function will:

Give good error messages in the presence of things that aren't actually mappings being passed into it.
Will actually make sure that the RawMetadata returns is typed correctly, so that mypy will work better to prevent typing mismatches.
Will put any leftover data (including fields that weren't typed correctly) into a separate data structure that is also typed correctly (even though it's with Any, because we don't know anything else about it).

That does go against what's in PEP 566.

Well part of the idea here is that RawMetadata doesn't have to match what the on-disk format is. We're not writing something that you pass directly to json.loads(), it's our own format, and we can do what makes the most sense programmatically here.

RawMetadata isn't specified in a PEP after all, it's just an intermediate format that allows us to cleanly separate the concerns of getting data that is serialized in multiple possible formats (including future formats), from validation and constituting them as "full" objects.

What I'm more concerned about is any of the structured data in pyproject.toml which doesn't map out well to what METADATA would produce without parsing it. Is that clean information to be tossed out, converted to a string to be parsed again? Supported in multiple ways in the JSON dict and you isinstance() to know how it's structured?

As I mentioned above, I don't think pyproject.toml should be a concern here, because it's a different thing than METADATA and metadata.json, it has different validation rules and is just fundamentally an input that is passed into a build tool, that ultimately outputs a core metadata, and as part of that it's made up of pieces that are equivalent to core metadata, but it itself is a distinct thing.

That being said, pyproject.toml is a useful example for why I think "just use the JSON" is the wrong path to go here. It uses different data structures, different names, etc, from any of the other pieces of the core metadata spec, and as such you can't pass the result of toml.loads() and json.loads(), even if they are both 100% correct, into the same function without doing some massaging.

For example metadata.json has project_url: list[str], and depending on how you reference it, pyproject.toml either has a dict named project, with a key named urls: dict[str, str] or a key named urls: dict[str, str]. Either way, it's going to require massaging that data.

Now if you end up agreeing that pyproject.toml is a wholly different thing, that's great. But I still think this "parse into a RawMetadata" is something we need for every format, because there's nothing special about pyproject.toml in the above example (the reasons it should be a separate thing is higher level). It's just a serialization format that happened to map the underlying data in a slightly different way than the other serialization formats did, and unless we commit to the idea that every serialization format has to look exactly like metadata.json does today when naively deserialized [^1], then every other serialization format is going to require massaging.

Which raises the questions:

Do we expose the intermediate format to end users so they can implement lenient parsing [^2]?
Why tie our intermediate format to metadata.json, just to special case JSON, and not even JSON, but specifically a "what's the minimal way we can cram email headers into JSON" format" when we can do better for basically no cost?
Why assume that the only bad data that we'll want lenient parsing for analyzation or to work in more scenarios we'll ever encounter is in the METADATA format? People are equally as good at producing bad JSON or bad TOML or bad any serialization format we pick.

[^1]: Which I think would be a shame, when defining how to serialize our metadata into a new serialization format, we should strive to produce an output that feels like it belongs in that serialization format. I think it's about as lame to define a JSON standard that feels like it's just email headers, but with different syntax, as it is to write Python, but like you're a Java programmer. [^2]: For example, the pyproject.toml example that massaging still exists, it's just buried in the pyproject.toml -> Metadata function.

brettcannon commented 2 years ago

I'm confused what you would be redoing if it's already structured appropriately?

Let's take author details as an example. In METADATA, it can come in via Author or Author-Email. Both are strings. Otherwise it's a bit messy how they will be structured. They can also only occur once.

Now consider authors from pyproject.toml. The name and email are already structurally separated. There can multiple entries.

How do you reconcile those two differences of the data's structure in your RawMetadata format that you would want to use as an intermediary format? Both can be put into RawMetadata without any potential failure in processing, but they are coming in from different structures where how they come into Metadata could be useful. If you lean into the greatest common denominator (GCD) approach, then I guess you could reformat authors into a format that makes sense for Author-Email, but that's obviously unfortunate if we have to go that way as Metadata structures author_emails into two-item tuples.

Now if you end up agreeing that pyproject.toml is a wholly different thing, that's great. But I still think this "parse into a RawMetadata" is something we need for every format, because there's nothing special about pyproject.toml in the above example (the reasons it should be a separate thing is higher level).

I think this is where I'm having a hard time understanding the balance of the complexity of what I would assume would need to be the greatest common denominator or have every which way to specify data from every format and just be a monster of an object (in terms of structure).

Why assume that the only bad data that we'll want lenient parsing for analyzation or to work in more scenarios we'll ever encounter is in the METADATA format? People are equally as good at producing bad JSON or bad TOML or bad any serialization format we pick.

It's not the only bad one, just the worst one. 😉

I think the thing we can all agree on is people have various uses for getting the data out of bytes and into a format that can be programmatically accessed without any data loss somehow.

What Donald and Paul seem to be advocating for is an intermediate format that seems to be some dict or data class that holds the data in the most flexible way possible to separate out the reading part from the parsing/processing part. I seem to be advocating for skipping the intermediate stage and going straight to Metadata with some way to let the caller know what data wasn't ultimately used. That seem like a fair assessment?

dstufft commented 2 years ago

I think the thing we can all agree on is people have various uses for getting the data out of bytes and into a format that can be programmatically accessed without any data loss somehow.

I think I would reword this to with minimal data loss. I think ultimately the scale of nonsense garbage that can get shoved in these files is sufficiently unbounded that there's no way to write a general-purpose parser that has no data loss.

Let's take author details as an example. In METADATA, it can come in via Author or Author-Email. Both are strings. Otherwise it's a bit messy how they will be structured. They can also only occur once.

Now consider authors from pyproject.toml. The name and email are already structurally separated. There can multiple entries.

Ultimately you just pick something and run with it, I think?

The authors/maintainers field here is interesting because while pyproject.toml is an easier to understand data format, it's actually less powerful than what the core metadata allows.. which is fine for a restricted input format, but that means that pyproject.toml can't represent otherwise valid metadata [^1].

If I were modeling that, I would represent it like this (completely untested) [^2]:

import tomllib

from email.parser import BytesParser
from email.headerregistry import AddressHeader
from typing import TypedDict, Optional, Union

class RawPerson:
    name: Optional[str]
    email: Optional[str]

# Core metadata spec supports the RFC From header, which has support
# for named groups of email addresses, it's "great".
#
# Maybe it would make more sense to ditch ``RawGroup``, and just add
# an optional "group" field on RawPerson that's just a string of the group
# name, if there is one?
class RawGroup:
    group: Optional[str]
    persons: list[RawPerson]

class RawMetadata:
    authors: Optional[list[Union[RawPerson, RawGroup]]]

def parse_email(data: bytes) -> tuple[RawMetadata, dict[Any, Any]]:
    raw: RawMetadata = {}
    leftover: dict[Any, Any] = {}

    parsed = BytesParser().parsebytes(data)

    for key, value in parsed.items():
        if key == "Author":
            # This sucks, but the core metadata spec doesn't really give us any
            # format that this value is in, so even though the example gives an
            # email, we can't really do anything about it and just have to treat
            # it as a string.
            raw.setdefault("authors", []).append(RawPerson({"name": value}))
        elif key = "Author-Email":
            # This API is gross, maybe there's a better one somewhere in the stdlib?
            address = {}
            try:
                AddressHeader.parse(value, address)
            except Exception:  # No idea what errors it can return, seems like a bunch of random ones
                leftover[key] = value
            else:
                authors = raw.setdefault("authors", [])
                for group in address.get("groups", []):
                    if group.display_name is None:
                        for addr in group.addresses:
                            p: RawPerson = {}
                            if addr.display_name:
                                p["name"] = addr.display_name
                            if addr.addr_spec:
                                p["email"] = addr.addr_spec
                            authors.append(p)
                    else:
                        rg = RawGroup({"group": group.display_name, "persons": [})
                        for addr in group.addresses:
                            p: RawPerson = {}
                            if addr.display_name:
                                p["name"] = addr.display_name
                            if addr.addr_spec:
                                p["email"] = addr.addr_spec
                            rg.persons.append(p)
                        authors.append(rg)

    return raw, leftover

# If we must...
def parse_pyproject(data: bytes) -> tuple[RawMetadata, dict[Any, Any]]:
    raw: RawMetadata = {}
    leftover: dict[Any, Any] = {}

    parsed = tomllib.loads(data)

    # Going to arbitrarily decide that the rest of the toml file doesn't
    # matter, just the projects key.
    if "projects" not in parsed:
        raise ValueError("No metadata")

    for key, value in parsed.items():
        if key == "authors":
            # An interesting question is whether a single entry in a list like this should fail
            # parsing the entire list, or should we be greedy and parse as much as possible?
            # 
            # I'm going to fail parsing the entire key, because if their is bad data, we don't know
            # what else is broken.
            authors = []
            invalid = False
            for entry in value:
                if isinstance(entry, dict) and ("name" in entry or "email" in entry):
                    p = RawPerson()
                    if "name" in entry:
                        p["name"] = entry["name"]
                    if "email" in entry:
                        p["email"] = entry["email"]
                    authors.append(p)
                else:
                    invalid = True
                    break

            if invalid:
                leftover["authors"] = value
            else:
                raw["authors"]

    return raw, leftover

[^1]: I swear I'm not trying to sound like a broken record here, but I really do think that we're going to save ourselves a lot of pain if we don't try to cram pypyroject.toml into the same box as the core metadata. It was intended for a different purpose than the core metadata, and it has different rules and it can't even fully represent everything that core metadata can (which is fine! It's designed as a more restrictive input format, not as a replacement for the core metadata). [^2]: BTW, writing this in a GitHub comment instead of a real editor was a mistake. It's easy to forget how much a real editor does for you.

dstufft commented 2 years ago

I'm going to sketch up a PR here, and see how things turn out!

dstufft commented 2 years ago

Something that I noticed while implementing this:

PEP 566 says this about converting metadata into json:

It may be necessary to store metadata in a data structure which does not allow for multiple repeated keys, such as JSON.

The canonical method to transform metadata fields into such a data structure is as follows:

The original key-value format should be read with email.parser.HeaderParser;

All transformed keys should be reduced to lower case. Hyphens should be replaced with underscores, but otherwise should retain all other characters;

The transformed value for any field marked with “(Multiple-use”) should be a single list containing all the original values for the given key;

The Keywords field should be converted to a list by splitting the original value on whitespace characters;

The message body, if present, should be set to the value of the description key.

The result should be stored as a string-keyed dictionary.

However, the packaging specifications say this:

Although PEP 566 defined a way to transform metadata into a JSON-compatible dictionary, this is not yet used as a standard interchange format. The need for tools to work with years worth of existing packages makes it difficult to shift to a new format.

but more importantly, it says this about the keywords field:

A list of additional keywords, separated by commas, to be used to assist searching for the distribution in a larger catalog.

Note: The specification previously showed keywords separated by spaces, but distutils and setuptools implemented it with commas. These tools have been very widely used for many years, so it was easier to update the specification to match the de facto standard.

From what I can see, the specifications haven't explicitly altered how metadata.json is intended to be produced, so in theory the algorithm in PEP 566 is still canonical, but given keywords have been changed to be comma delimited, we should probably just ignore that part of the PEP and do what the specifications say to do for keywords.

dstufft commented 2 years ago

Next fun fact!

You can't actually use BytesParser, BytesFeedParser, or message_from_bytes to parse METADATA files, because METADATA files are explicitly utf8, and all of those classes are implemented such that they actually just wrap their string based counter parts with a hard coded .decode("ascii", "surrogateescape") attached to the incoming bytes.

You can of course just decode the bytes yourself as utf8 and pass them into Parser etc, but then you lose leniency to be able to attempt to parse something out of those files.

abravalheri commented 2 years ago

How about message_from_bytes(content, EmailMessage, EmailPolice(max_line_length=math.inf, utf8=True))?

merwok commented 2 years ago

(s/Police/Policy/)

dstufft commented 2 years ago

Nope.

The message_from_bytes function is a thin wrapper around ByteParser.parsebytes.

The ByteParser class is entirely implemented as a wrapper around FeedParser, with hard coded calls to decoding the bytes with ascii + surrogateescape.

dstufft commented 2 years ago

Of course, if you don't care about being lenient, you can just do message_from_string(binary_content.decode("utf8"), policy=email.policy.compat32).

That will correctly parse the METADATA file as UTF8, but one of the things Paul mentioned wanting was the ability to not fail if the METADATA file is mojibaked, but the one header you need is in a valid encoding which you can't do if you're decoding the entire content at once.

pfmoore commented 2 years ago

pkg_metadata includes the dance you need to do to read bytes with arbitrarily encoded text, if you’re interested.

dstufft commented 2 years ago

Unless I missed it (which I'm very tired at the moment, so it's possible!) it just calls message_from_bytes, which does the wrong thing even on valid data?

pfmoore commented 2 years ago

Sorry I’m on mobile so I’ll have to check the details later, but I think you want the sanitise_header function which tries utf8 then latin1 on unknown encoded data.

dstufft commented 2 years ago

No worries, I'll have to look at it tomorrow anyways, my brain is dead for the night and can't stand staring at the email module anymore tonight :)

https://github.com/pypa/packaging/pull/574 here's what I have so far as a PoC, not sure if it actually works yet, haven't even ran it, that's for tomorrow :P

dstufft commented 2 years ago

Oh I see, the email module returns a Header object instead of a str if it detects a surrogate escape. That feels like a terrible API and I don't see it actually documented anywhere, but it feels less fiddly than what I was doing.

dstufft commented 2 years ago

So that works for header, but valid Description data isn't being handled properly FWIW:

>>> pkg_metadata.bytes_to_json("Name: ☃\n\n☃\n".encode("utf8"))
{'name': '☃', 'description': '���\n'}

It looks like if you do msg.get_payload(decode=True), you'll get the raw bytes though, then you can do whatever you want.

pradyunsg commented 2 years ago

FWIW: https://github.com/pypa/installer/blob/6c3118d04e9a279f8f5b972ba797387451c7a6b4/src/installer/utils.py#L83

>>> parsed = installer.utils.parse_metadata_file("Name: ☃\n\n☃\n")
>>> list(parsed.items())
[('Name', '☃')]
>>> parsed.get_payload()
'☃\n'

Edit: NVM me -- we're thinking about the case where the encoding of the file is wonky and we're pulling in bytes.

dstufft commented 2 years ago

Yes, that only works because you've already decoded the bytes as utf8, so the email module doesn't do any munging.

One of the goals of lenient parsing here is that one stray byte doesn't cause the entire METADATA file to be completely unparseable, as it would if we just decoded the entire bytes up front.

brettcannon commented 2 years ago

Ultimately you just pick something and run with it, I think?

Yeah, I think it's the YOLO structure that my brain that is tripping it up.

I swear I'm not trying to sound like a broken record here, but I really do think that we're going to save ourselves a lot of pain if we don't try to cram pypyroject.toml into the same box as the core metadata.

So are you saying not even attempt to convert pyproject.toml to Metadata? And if so, are you then suggesting a separate chunk of code to get it to PKG-INFO / METADATA since that's where the data is ultimately ending up? Or are you saying "don't try to go straight to core metadata, instead go to the intermediary that I'm suggesting"?

2. BTW, writing this in a GitHub comment instead of a real editor was a mistake. It's easy to forget how much a real editor does for you.

https://vscode.dev / https://github.dev to the rescue. 😉

dstufft commented 2 years ago

So are you saying not even attempt to convert pyproject.toml to Metadata?

Yes.

There is no way to generically convert an arbitrary pyproject.toml to a valid set of core metadata, because the build tool, whatever it is, necessarily has to get involved.

For instance, let's take a look at the version field. In the core metadata spec, Version is a mandatory field and it MUST NOT be dynamically defined. However, in pyproject.toml the version field is perfectly capable of being defined as dynamic, and the intention is that when the build tool builds the project, it will resolve the version to a static value, and include it in the core metadata as a static value.

So given a pyproject.toml like:

[project]
name = "foo"
dynamic = ["version"]

Could not possibly be generically turned into a METADATA file without invoking the build tool.

In the same vein, but even deeper, the pyproject.toml spec allows omitting the [project] table completely, in which case all fields are dynamic.

Another mismatch is that the pyproject.toml spec has several fields that are paths to files relative to the pyproject.toml location, but which the build tool is intended to read those files, then emit the contents of those files into the core metadata, not the file paths themselves. Examples of this are license.file and the readme key.

The pyproject.toml also contains things which aren't part of the core metadata at all (though you could argue they should be), but the various entrypoint keys all end up in a currently non-standard file.

PEP 639 additional mismatches between pyproject.toml and the core metadata. It added the license-files key, but that has to be transformed by the build backend into a number of License-File keys + adding additional files into the resulting .dist-info directory.

Now that being said, one of the benefits of pyproject.toml is that, in certain circumstances, you can parse out bits and pieces of the data that will eventually become the core metadata, such that if you only care about a few fields, and you're able to cope with the fact that you can't always do you, you can get some idea of what the core metadata might look like. However, you can't get a valid core metadata from an arbitrary pyproject.toml (nor should we even attempt to make that a goal).

So in my head, the core metadata and the pyproject.toml metadata (what I'm going to start calling the "build metadata" from now on in this comment) are two distinct types, and it just so happens that they share some common attributes, and the build metadata, combined with a possibly arbitrary set of steps via the build backend, is used in ultimately producing the core metadata.

I can imagine there being some sort of build metadata class that would be a useful introduction to packaging, and maybe some utility functions to aid in validating it, and pulling what information is able to be pulled out in a generic fashion.

However, I think if we try to cram them into the same class, we're quickly going to end up with one class that does a bad job representing both the core metadata AND the build metadata, instead of two classes that do good jobs of representing their respective data.

dstufft commented 2 years ago

I will say though, that I think that a build tool is likely always going to be working directly with the pyproject.toml file in a way that I can't imagine our utility functions being a particularly huge win for them. I also think that the use case of trying to pull "what you can" out of the pyproject.toml is something that the exact definitions of how much you care about being accurate will have huge implications, and it will be hard to write generic code that doesn't boil down to being a tiny wrapper over tomllib.loads()... so I'm not actually sure there's a ton of value in packaging trying to mess with pyproject.toml much at all, except maybe by providing an extensible validator? But even then we could probably just write that as a jsonschema.

dstufft commented 2 years ago

More nonsense[^2] (we maybe want to update the spec or write a PEP or something I don't know)!

RFC 822 has no real mechanism for header values to contain \n, so all of our metadata except description [^1] cannot cleanly contain new lines.

RFC 822 does have a mechanism that sort of allows new lines in a header value, which is called "folding". The idea is that if a header value is particularly long, you could take a message like:

MyHeader: This is a really long line, well not really but we will pretend.
AnotherHeader: A short one!

The Body of my message.

and flow MyHeader over multiple lines like so:

MyHeader: This is a really long line, well
    not really but we will pretend.
AnotherHeader: A short one!

The Body of my message.

Functionally the way this works is that when parsing an RFC 822 message, header lines that start with a space are thought to be continuations of the last line, so header names can never start with a space, and when emitting a header, you can continue the header onto the next line by introducing a line break somewhere that you have a space and continuing the next line with a leading space.

Obviously, those two headers, before and after folding are not equal, the first one has a value of "This is a really long line, well not really but we will pretend." but the second one has a value of "This is a really long line, well\n not really but we will pretend.".

That wouldn't be a particularly great feature if whatever value was being stored there was not capable of coping with extraneous new lines and extra white space, so RFC 822 has a mechanism for "unfolding" the header value to get the original value back, which is basically to run re.sub(r"\n\s+", " ", folded) to collapse a new line and multiple spaces back into a single space, which allows round tripping these folded values back into the original value.

Now with that background, we have the core metadata spec which generally does not make any claims about whether a new line is a valid value in any of these fields or not, but in practice we've nothing has ever prevented them. Distutils has a rfc822_escape function, which essentially turns "\n" into "\n" + 8 " ", but that's relying on the above "folding" feature which isn't meant for escaping.

That means that a value like "Line#1\nLine#2\nLine#3", after the "escaping" would turn out to be "Line#1\n Line#2\n Line#3", which if you attempted to use the standard RFC822 unfolding, would turn that into "Line#1 Line#2 Line#3", which would be wrong. To correctly unescape this you'd to replace the fold characters with new lines instead of a single space.

Unfortunately, RFC 822 allows any value to be folded, whether they have new lines or not, so we have no way to tell if a value got "folded" or "escaped" to know how to translate it back.

To make matters worse, from what I can tell there is basically no uniformity in how tools are emitting these values.

Some tools are taking a string and just emitting it regardless of what value is in it, which means that they'd produce a malformed METADATA that can't be correctly parsed if any field has a stray \n in it.

Other tools are escaping some of the fields, but not others.

This affects the Description spec too, since it's technically allowed to be emitted as a header instead of as the message body (there is no new line problem with the message body), and even worse, the PEPs have been inconsistent in how they handle this escaping.

PEP 241, 314 (just uses new line escapes):

Description: This module collects votes from beagles
             in order to determine their electoral wishes.
             Do NOT try to use this module with basset hounds;
             it makes them grumpy.

PEP 345, packaging.python.org (uses a pipe character[^3])

Description: This project provides powerful math functions
        |For example, you can use `sum()` to sum numbers:
        |
        |Example::
        |
        |    >>> sum(1, 2)
        |    3
        |

This bar escaping has it's own problems, in that if anything as unfolded the message before you've gotten ahold of it yourself, you're incapable of undoing the escape. That isn't a problem for Python itself, but if anyone else attempts to read these files it may be?

That's kind of a moot point though, because as far as I can tell, nobody has ever implemented the bar escaping thing, and everybody is just either emitting actually broken METADATA files[^4] or files using the distutils style RFC822 "escaping", and since the email parser doesn't implement the unfold, they're just getting different data out of the METADATA than was put into it, which is, IMO, subtlety broken METADATA files [^5].

Ultimately, I think the best way to handle this is to amend the spec to disallow new lines in any field but Description, and requiring the Description field to be emitted as the message body, and to recommend that parsers that are parsing old data should attempt to work with the folded values and implement the textwrap.dedent() unescaping.

[^1]: Since we can't have nice things, this is only kind of true. When the description is included as the message body in the METADATA file it supports new lines fine. When we try to use it as header field (which is nominally supported) then we run into the new lines problem. [^2]: This issue gets to document all the terrible things I'm finding while implementing this PoC I guess. [^3]: This solves another problem in that RFC 822 unfolding doesn't require each folded line to the same indentation level, so if someone unfolded the naive "rfc822_escape" approach, all leading whitespaces would be removed from the source material, removing any interior indentation that may exist. [^4]: A blank line is the mechanism RFC 822 uses to end parsing headers and move onto the body, so if you emit say a License value with This is My License\n\n and the thing that's writing the METADATA hasn't escaped the value, then any remaining headers will get prepended as raw text onto the body. [^5]: Then you have tools that do actually attempt to implement the "unescaping" by using textwrap.dedent(), but that requires using it on lines[1:] and not the first line, because the first line doesn't get escaped. Of course, that unescaping only works if you're able to get the raw value without something unfolding it for you.

brettcannon commented 2 years ago

So are you saying not even attempt to convert pyproject.toml to Metadata?

Yes.

There is no way to generically convert an arbitrary pyproject.toml to a valid set of core metadata, because the build tool, whatever it is, necessarily has to get involved.

OK, I'm convinced. That means having a header -> JSON step makes much more sense to me now.

I will say though, that I think that a build tool is likely always going to be working directly with the pyproject.toml file in a way that I can't imagine our utility functions being a particularly huge win for them.

Hopefully the structure of pyproject.toml is good enough to work with directly. 😅

we maybe want to update the spec or write a PEP or something I don't know

If things aren't clear on packaging.python.org then yes.

I think the best way to handle this is to amend the spec to disallow new lines in any field but Description

That will probably affect License the most (although I do agree with your assessment).

requiring the Description field to be emitted as the message body

👍

recommend that parsers that are parsing old data should attempt to work with the folded values and implement the textwrap.dedent() unescaping.

as far as I can tell, nobody has ever implemented the bar escaping thing

Who should we check with to see how they are doing Description today? Is this when we ask @pfmoore to use his magical metadata stash to see if there's use of |?

I think we are simply stuck with old descriptions being messed up for raw text output and lean into the fact that Description is almost always rendered as reST or Markdown and thus newlines vs spaces isn't as critical as long as two newlines in a row are kept somehow.

dstufft commented 2 years ago

For what it's worth, one of the projects doing the textwrap.dedent() thing is the wheel library, so I suspect it's pretty safe?

We could also just not do any unwrapping, when parsing it'll just get emitted as the "folded" (so new lines with indents) value, and we can leave it to the caller to figure out what the heck to do with it. That seems pretty reasonable to me too. The spec should maybe just note the problem with historical data, and say that for new data, this should DRT.

brettcannon commented 2 years ago

We could also just not do any unwrapping, when parsing it'll just get emitted as the "folded" (so new lines with indents) value, and we can leave it to the caller to figure out what the heck to do with it.

I like that idea. Whether it's going to be worth the effort will depend on the user of the data, so let's not waste the CPU cycles if there's a decent chance that part of the metadata could get ignored in the end by the user of the data.

pfmoore commented 2 years ago

Hmm, there's a lot to read here, and I haven't had the time to digest everything yet (and @dstufft is making discoveries faster than I can keep up with 🙂)

Some high level points for now:

The metadata spec is very vague. It only really defines how to read metadata, and that is by reference to the email module in the stdlib.
The only reference to writing metadata is under the Description field, which covers the "vertical bar" escaping format. I believe this is correctly handled by the email module on reading, but I'm not sure (see below).
Nobody, to my knowledge, uses that mechanism anyway. I've no idea when it was added to the spec, but it seems to have been completely ignored by tools. Tools appear to follow a mechanism from distutils, which is to add 8 spaces after any newline. This preserves newlines, but messes up indentation unrecoverably.
Regardless of what we decide, any practical parser will have to read what's currently written, as best it can. Ignoring all existing metadata is IMO a non-starter.
If we want to rewrite the spec to be precise, I'd strongly suggest abandoning the email format and going for something more robust, like JSON (there was an attempt at moving to metadata.json at one point, but it never really got anywhere). There's no need for the file to be human editable, or even particularly human readable. But that would me a major change for the ecosystem.

We could also just not do any unwrapping, when parsing it'll just get emitted as the "folded" (so new lines with indents) value, and we can leave it to the caller to figure out what the heck to do with it.

For parsing existing data, this seems reasonable. For writing, we need to decide how we should output data containing newlines. This is awkward, as the email module itself is apparently broken in this regard:

>>> mm = email.message_from_string(m.as_string())
>>> mm["foo"]
'Hello\n   world'
>>> m = email.message.Message()
>>> m["Foo"] = "Hello\nworld"
>>> m.as_string()
'Foo: Hello\nworld\n\n'
>>> mm = email.message_from_string(m.as_string())
>>> mm["foo"]
'Hello'

I can think of 3 options:

Perpetuate the "add 8 spaces after newlines" approach, on the basis that it's not our job to invent new behaviour here, we should be implementing the existing (de facto) standard. This is what I chose for pkg_metadata, on a "least bad" basis.
Use the "vertical bar" approach. This is technically more spec-compliant (although only for Description) but it isn't parsed correctly by the email module, which means that it will probably be parsed incorrectly by all existing tools (and also means that the spec contradicts itself, for what that's worth 🙁)
Something else. But anything I can think of is going to be a spec change that affects interoperability in a non-trivial way, and we have to be prepared to properly propose and agree such a change. Again, I don't think it's our job to invent new behaviour here, so if we want to go that route, we'd need to put packaging.metadata on hold while someone goes through the process of changing the standard...

None of these options are particularly good, IMO.

To demonstrate the parsing problem mentioned in (2):

>>> email.message_from_string("Foo: Hello\n       |world")["Foo"]
'Hello\n       |world'

Sure, consumers could probably recover the original value from that, but ugh.

dstufft commented 2 years ago

My example would be to just require (or we could recommend I guess) that tools do not emit values with new lines at all, except for description, where we require (or recommend) emitting as body.

dstufft commented 2 years ago

Tools like pip etc would obviously have to put up with the "bad" style forever, but if the spec mandates emitting new metadata like that, then we can require it in Warehouse too, which stops the bleeding at least.

pfmoore commented 2 years ago

That might work, yes. We might find people have been providing values for "Summary" that aren't a single line, but that's OK. The "License" field explicitly gives a multi-line example, but if the PEP for license values gets sorted, that will hopefully fix that one. "Author" and "Maintainer" have multi-line examples, but they seem to just be folded versions of what's logically just a long line, so fixing the examples should be enough.

I'd be OK with changing the spec like that. It would need to be agreed on the Packaging discourse, and if anyone objected to treating it as a clarification it would need to go through the PEP process, but if we can agree on this, it would hopefully introduce some sanity.

But unless we want to tackle multi-line "License" fields, we couldn't make this change until the license values PEP is done...

pfmoore commented 2 years ago

Tools like pip etc would obviously have to put up with the "bad" style forever

Which (just to come back on-topic) means IMO that packaging probably needs to support reading the "bad" style, too.

dstufft commented 2 years ago

Yea I agree, at least at the "lenient" layer, if not the whole way through.

brettcannon commented 2 years ago

5. If we want to rewrite the spec to be precise, I'd strongly suggest abandoning the email format and going for something more robust, like JSON (there was an attempt at moving to metadata.json at one point, but it never really got anywhere). There's no need for the file to be human editable, or even particularly human readable. But that would me a major change for the ecosystem.

One of the key motivators for me in doing this work is to make generating the metadata.json file so easy that we can start encouraging build tools to include it alongside METADATA. Then we can eventually require metadata.json next to METADATA. And then some time way off in the future, maybe we can stop requiring METADATA.

njsmith commented 2 years ago

I'm coming in late and only skimmed this, but a few thoughts I wanted to put in:

To me, the whole idea of metadata.json seems... worse than what we have now? Yes, the weird rfc822-but-not-quite format is an insult to all right-thinking people, but we're stuck supporting it; adding another format just means now we have two problems.

And really, the current format is... fine? Like it's gross, but it's easy enough to make it do all the things we need it to do. Worst case if we need more structured fields, we can define new fields whose values are JSON objects. (That's what I'm going to do in the pybi spec next time I get around to updating it.)

Formalizing the current format does seem useful though. Sort of the HTML5 philosophy: you can't go back and fix the past, but at least you can formalize it and make it workable. In case it helps, I do have an formal, executable PEG grammar for METADATA files, though I haven't run it against all of PyPI or anything to check how robust it is: https://github.com/njsmith/posy/blob/main/src/vocab/rfc822ish.rs

pradyunsg commented 2 years ago

FWIW, inspect.cleandoc does exactly what we want, with the first line not being indented but the rest are.