google / guava

Google core libraries for Java
Apache License 2.0
50.04k stars 10.86k forks source link

`MediaType.parse` doesn't like Cyrillic #6339

Closed boris-petrov closed 1 year ago

boris-petrov commented 1 year ago

Guava 31.1-jre, JDK 19.

MediaType.parse("application/pdf; name*0=\"кирилица\"")

Leads to an exception:

java.lang.IllegalArgumentException: Could not parse 'application/pdf; name*0="кирилица"'
    at com.google.common.net.MediaType.parse(MediaType.java:1078)
    at ...
Caused by: java.lang.IllegalStateException
    at com.google.common.base.Preconditions.checkState(Preconditions.java:486)
    at com.google.common.net.MediaType$Tokenizer.consumeToken(MediaType.java:1101)
    at com.google.common.net.MediaType.parse(MediaType.java:1066)
    ... 2 more

I believe this is a valid mime-type declaration? I saw it in the wild coming from Java Mail.

nick-someone commented 1 year ago

From my reading of RFC 2045, which governs the shape of the parameters and values, the allowed tokens are US-ASCII characters: https://www.rfc-editor.org/rfc/rfc2045#section-5.1

token := 1*<any (US-ASCII) CHAR except SPACE, CTLs, or tspecials>

On Fri, Feb 24, 2023 at 1:34 AM Boris Petrov @.***> wrote:

Guava 31.1-jre, JDK 19.

MediaType.parse("application/pdf; name*0=\"кирилица\"")

Leads to an exception:

java.lang.IllegalArgumentException: Could not parse 'application/pdf; name*0="кирилица"'

at com.google.common.net.MediaType.parse(MediaType.java:1078)

at ...

Caused by: java.lang.IllegalStateException

at com.google.common.base.Preconditions.checkState(Preconditions.java:486)

at com.google.common.net.MediaType$Tokenizer.consumeToken(MediaType.java:1101)

at com.google.common.net.MediaType.parse(MediaType.java:1066)

... 2 more

I believe this is a valid mime-type declaration? I saw it in the wild coming from Java Mail.

— Reply to this email directly, view it on GitHub https://github.com/google/guava/issues/6339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEKCYLODWKPRVMYRD3356LWZBP7TANCNFSM6AAAAAAVGSHGNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

netdpb commented 1 year ago

But the quoted-string content doesn't appear to be defined.

cpovirk commented 1 year ago

I suspect that the definition is pulled from https://www.rfc-editor.org/rfc/rfc822, which looks like it's restricted to ASCII.

netdpb commented 1 year ago

I was editing my comment to say that. And yes, section 3.3 says its contents are CHARs, which are restricted to ASCII.

netdpb commented 1 year ago

Now perhaps MediaType should be more lenient. What's the intention there?

cpovirk commented 1 year ago

I don't remember anything specifically, but I'd guess that our general preference is to be strict if it's not causing problems and then to reevaluate when we hear about those problems.

I pulled probably a dozen MediaType IllegalArgumentException bugs up from a quick internal bug search, and none appeared to be about non-ASCII characters, only about other malformed inputs. Now, there were other such bugs that I didn't look at, but I didn't see immediate signs that this is common.

It's especially scary to let unusual characters into headers, since different tools may interpret them differently. I'd probably err on the side of leaving things as they are unless we find more evidence of problems, at which point we might try letting in specific defined ranges of additional characters.

netdpb commented 1 year ago

I'm inclined to agree with you. Closing this as WAI, but if you have more data to suggest that this occurs frequently in the wild, @boris-petrov, please reopen.