UTF-8 mime-type constants don't work well with browsers' `Accept` header

lo48576 commented 3 years ago

Some of predefined text mime-types (such as mime::HTML and mime::XML) have ;charset=utf-8 parameter, but browsers' default Accept header does not. (See List of default Accept values - HTTP | MDN for browsers' defaults.)

When servers (written in Rust) uses mime::{HTML, XML, ...} to represent available content types, content::Accept::negotiate() does not work well with browsers' default Accept, and it fails with "No suitable Content-Type found" error.

let mime_html = "text/html".parse::<Mime>()?;
let mime_xhtml = "application/xhtml+xml".parse::<Mime>()?;

// This is simple version of browser's `Accept` value.
let mut browser_accept = Accept::new();
browser_accept.push(MediaTypeProposal::new(mime_html, None)?);
browser_accept.push(MediaTypeProposal::new(mime_xhtml, None)?);

// This is server's default.
let acceptable = &[mime::HTML];

let res = Response::new(200);
let content_type = browser_accept.negotiate(acceptable);

// I expected this to success, but fails!
assert!(
    content_type.is_ok(),
    "server is expected to return HTML content"
);

So, the questions are:

Should difference of parameters' existence cause negotiation to fail?
Is UTF-8 MIME constants (mime::{JAVASCRIPT, CSS, HTML, PLAIN, XML} as of http-types 2.11.1) intended to be used by servers?
- If so, how they make negotiation succeed when requests are from browsers?
- If not, are servers recommended to execute "text/html".parse::<Mime>().unwrap() as they need, or are there more better ways?

Fishrock123 commented 3 years ago

@yoshuawuyts

yoshuawuyts commented 3 years ago

Hey, thanks for filing this. That's a really interesting question! I think your breakdown of the two questions is exactly right, and we should seek to answer those.

Should difference of parameters' existence cause negotiation to fail?

RFC7231:3.1.1 has the following to say:

The presence or absence of a parameter might be significant to the processing of a media-type, depending on its definition within the media type registry.

That means that whether negotiation fails is up to the specific parameter. If we look in the registry for our HTML type, we can find the following:

The charset parameter may be provided to definitively specify the document's character encoding, overriding any character encoding declarations in the document. The parameter's value must be one of the labels of the character encoding used to serialize the file.

Similar texts are provided for css, javascript, xml. The way I'm interpreting this is: if an encoding is provided it should be respected. This means we can infer the following rules:		`Accept: text/html`	`Accept: text/html; charset=utf-8`
`Content-Type: text/html`	✅	❌
`Content-Type: text/html; charset=utf-8`	✅	✅

The only case that fails is if a client demands a specific encoding, but we cannot guarantee we'll use that charset. [1] The way I'm thinking about this is in terms of "specificity". The client makes demands, and if we can provide more specific values than what the client demands that is okay.

Are utf-8 constants intended to be used by servers?

They definitely are intended to. But whether we're doing a good job at that is a different question. Given the rules we've found in the section above, I think the example in https://github.com/http-rs/http-types/issues/371#issue-941323576 should be made to work.

This means we don't need to make any changes to our types, but instead to the way we perform the comparison. I don't know if the right approach would be to somehow override methods, special case this information in Accept::negotiate, or perhaps something else. Folks are welcome to propose and implement solutions for this [2]!

[1]: Please verify that this is correct. I'm just now reading up on this, and I'm sharing sources so others can make sure what I'm saying is right (:

[2]: Some negative design space here: I don't think we should override PartialEq/Eq trait impls for the sake of content resolution. text/html and text/html;encoding=utf-8 are different encodings and equality checks should continue to reflect that. We should find a different way of implementing this.

http-rs / http-types

UTF-8 mime-type constants don't work well with browsers' `Accept` header #371

Should difference of parameters' existence cause negotiation to fail?

Are utf-8 constants intended to be used by servers?