python / cpython

The Python programming language
https://www.python.org
Other
63.45k stars 30.38k forks source link

Header access methods in EmailMessage return strings, not header objects #101090

Open bkline opened 1 year ago

bkline commented 1 year ago

Summary

The behavior of the EmailMessage class does not match what the documentation says it will do when code using the class tries to fetch a header parameter value.

The class has a get_param() method, but that's defined in a base class, and the documentation says that the base class (Message) should only be used when maintaining legacy code. Instead, we're told:

Note that "existing parameter values of headers may be accessed through the params attribute of the header value (for example, msg['Content-Type'].params['charset']).

The problem is that msg['Content-Type'] returns a string, not a header object, which Python correctly complains "has no attribute 'params'."

It's not obvious (to me, anyway) why set_param() (also defined in the same base class) is OK to use, but not get_param() (which, by the way, actually works, in contrast with the approach given in the documentation). The docstring for get_param() (which I used to think of as the "canonical documentation" until one of the core devs gently set me straight) gives no indication that using get_param() is frowned upon, but the really canonical documentation says

This is a legacy method. On the EmailMessage class its functionality is replaced by the params property of the individual header objects returned by the header access methods.

I suppose this ticket could go either way: a bug in the code (which should be returning objects instead of strings for the header access methods) or a bug in the documentation. I went with the first option, since if the documentation is wrong and the header access methods really are supposed to return strings, then something will still need to be done with the code to expose the needed functionality. There's a comment on _get_params_preserve() (which get_params() and get_param() both use to do their thing) asking if it should be part of the public interface, so that would be one solution (though I'd still want to know why get_param() shouldn't be used).

Environment

Context

I ended up in this module when I noticed that the cgi module has been deprecated.

Deprecated since version 3.11, will be removed in version 3.13: The cgi module is deprecated (see PEP 594 for details and alternatives).

The FieldStorage class can typically be replaced with urllib.parse.parse_qsl() for GET and HEAD requests, and the email.message module or multipart for POST and PUT. Most utility functions have replacements.

That's about it for guidance on how to make the transition. I feel a little bit like as if I'm being thrown back to my experiences as a young dad trying to assemble the toys on Christmas Eve with instructions written in a foreign language. 😅

vasile-gh commented 4 months ago

When creating the message object make sure it is of the type EmailMessage, not Message. You do that by explicitly using the policy named default at message creation time.

After that you will have the right result type for the header access methods.

steve-mavens commented 4 months ago

I think the correct code for replacing cgi.parse_header is like the following:

>>> message = EmailMessage()
>>> message['Content-Type'] = 'application/json; charset=utf8'  # or whatever
# The first part of the tuple returned by parse_header is the value, so in this case the content-type
>>> message.get_content_type()
'application/json'
# The second part the tuple is the dictionary of params
>>> message['Content-Type'].params
mappingproxy({'charset': 'utf8'})
>>> message['Content-Type'].params.get('charset', 'ascii')
'utf8'

This works for me in Python 3.12 (and specifically, message['Content-Type'] is not a string, it's a email.headerregistry._ContentTypeHeader).

For those who were using cgi.parse_header to parse something other than content-type, then of course you can't use get_content_type() to get the value. Each of the parameterized MIME headers has its own subclass of email.headerregistry.ParameterizedMIMEHeader with a different name for the attribute which contains the actual value excluding parameters. So, I don't think the documentation really considers the general case of "I was using cgi.parse_header and now it's deprecated". Each MIME header is its own special case, so perhaps the authors deemed that there isn't much you can do with a "generic MIME header", and therefore no use case for providing a code snippet demonstrating a straight replacement of cgi.parse_header.

steve-mavens commented 4 months ago

"I feel a little bit like as if I'm being thrown back to my experiences as a young dad trying to assemble the toys" - indeed, the migration path seems to be: "Read the entire documentation for email.message and email.headerregistry. The answer for your specific use case is in there somewhere" ;-)

bkline commented 4 months ago

It doesn't help that the documentation for the cgi module says

The cgi module is deprecated (see PEP 594 for details and alternatives).

and that PEP tells us to use email.message.Message to replace cgi.parse_header and cgi.parse_multipart.