Allow decoding of JSON messages in non-standard character sets

codehead commented 10 months ago

Summary

This PR modifies Mojo::Message to allow JSON decoding when a charset is specified. Currently all messages are decoded as UTF-8, so JSON messages with other encodings may fail silently.

Motivation

Mojo::UserAgent always tries to decode messages in UTF-8 ignoring the message character set. When a message does not meet the UTF-8 specification, Mojo::JSON::json_decode() fails silently and Mojo::UserAgent::json() returns undef. In character sets with a large overlap with UTF-8 such as ISO-8859-1, message decoding fails only when accented characters are present, so JSON messages migh seem empty at random. This is critical to interface with legacy systems that expose JSON messages in charsets different from UTF-8.

References

No public issues or PRs for Mojolicious as far as I'm aware. Other Perl frameworks allow non-UTF-8 charsets for JSON messages from very early on -- e.g. Catalyst::View::JSON. Some Java frameworks defaulted to ISO-8859-1 encoding as recently as 2020: Content Type being append with charset=ISO-8859-1 #1428. Even though RFC4627 tried to stardardize UTF-8 encoding for JSON messages as far back as 2006 (at around the same time some frameworks implemented charset specification to allow non-ascii characters in JSON) RFC8259's language leaves enough slack for systems that are part of a closed ecosystem.

jberger commented 10 months ago

I'm out of town and reading on my phone so I'm not going to do a full review, but I think I agree with @kraih there are things in there that don't pass the smell test

Grinnz commented 10 months ago

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

codehead commented 10 months ago

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent failure.

Grinnz commented 10 months ago

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

I don't see the comparison. The only purpose of ->text is to decode from the specified charset. The only purpose of ->json is to decode from spec-compliant JSON.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent fail.

There are plenty of issues one may run into in real world use cases, the framework has to balance predicting these with making the common and compliant case simple and efficient, in this case calling two functions seems like a simple option compared to complicating and slowing down the ->json method for all users in the common case where you receive JSON.

jberger commented 10 months ago

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

I don't see the comparison. The only purpose of ->text is to decode from the specified charset. The only purpose of ->json is to decode from spec-compliant JSON.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent fail.

There are plenty of issues one may run into in real world use cases, the framework has to balance predicting these with making the common and compliant case simple and efficient, in this case calling two functions seems like a simple option compared to complicating and slowing down the ->json method for all users in the common case where you receive JSON.

I concur with @Grinnz, there is a difference in that "text" is specified to be in a known (configured) charset, but JSON is by definition utf-8 encoded. As such, bytes that are json-like but encoded as some other charset are technically not actually JSON. I know that's not a very satisfactory answer, and it sounds dismissive, but given the difference there it seems not unreasonable to make the spec-compliant mechanism as simple and fast as possible while still making your use-case possible, if not quite as convenient. It seems that that is already the case.

kraih commented 10 months ago

I'm afraid it looks like this PR will not pass the vote.

mojolicious / mojo