json responses with 'application/json' and no charset are parsed in ISO-8859-1 charset

maxcom commented 7 years ago

JSON in HTTP are always encoded in UTF-8. Responses are parsed correctly when server writes content type header like application/json; charset=utf-8.

However, many servers (like Play framework itself) uses application/json without charset. In Play 2.6, that responses are parsed in ISO-8859-1 charset.

I think that problem is in JsonBodyReadables implementation. Json.parse(response.body) should be replaced with Json.parse(response.bodyAsBytes.utf8string). Charset for response.body is always ISO-8859-1 in async http client when no charset info is provided in content type header.

wsargent commented 7 years ago

Good catch.

gmethvin commented 7 years ago

Technically, JSON is not always UTF-8, but it is always Unicode, and application/json does not define a charset parameter. I would use Json.parse(response.bodyAsBytes.toArray) directly, since it will detect the encoding.

jsw commented 7 years ago

I've run into the same issue, where response.body does not correctly handle an emoji character. I'm not using play json to map the response body (manual jackson with scala module). Is the appropriate approach to use response.bodyAsBytes.toArray to get the raw bytes, which I can pass to Jackson, and use new String(response.bodyAsBytes.toArray) to get something for a log statement? I'm guessing that response.body is causing play to parse to JSON and then output to String, which is causing the encoding issue.

Malax commented 7 years ago

Is there an ETA on a release that includes this fix?

wsargent commented 7 years ago

@Malax within the next couple of weeks -- however, do note that due to the way that the standalone client works, it's fairly easy to create your own custom body readables and writables as a workaround.

Malax commented 7 years ago

@wsargent Thanks for the reply! 👍 If it's in the next couple of weeks, we will just delay the migration to 2.6 until then.

jsw commented 7 years ago

@wsargent Can you comment on my workaround above for accessing the raw bytes when I don't want to use the play JSON parser?

wsargent commented 7 years ago

@jsw You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

avdv commented 7 years ago

You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

jsw commented 7 years ago

Unless I'm doing things wrong, I think the issue still exists in play-ws 1.0.1.

import org.scalatestplus.play.PlaySpec
import play.api.mvc.Action
import play.api.mvc.Results.Ok
import play.api.test.Helpers._
import play.api.test.WsTestClient
import play.core.server.Server
import scala.concurrent.ExecutionContext.Implicits.global

class JsonEncodingTest extends PlaySpec {
  "ws" should {
    "handle utf8" in {
      Server.withRouter() {
        case _  => Action { Ok("""{"foo": "👍"}""").as(JSON) }
      } { implicit port =>
        WsTestClient.withClient { client =>
          println(await(client.url("/").get().map(_.body.toString))) // unexpected output
          println(await(client.url("/").get().map(r => new String(r.bodyAsBytes.toArray)))) // expected output
        }
      }
    }
  }
}

Output

{"foo": "ð"}
{"foo": "👍"}

wsargent commented 7 years ago

You should be calling

client.url("/").get().map(r => Json.parse(r.bodyAsBytes.toArray))

wsargent commented 7 years ago

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

In theory yes, in practice array copying usually doesn't yield much in the way of performance improvements because the JVM is pretty effective at array copying:

https://psy-lob-saw.blogspot.com/2015/04/on-arraysfill-intrinsics-superword-and.html

If you're interested, you can poke at the JVM overhead with JMH:

https://psy-lob-saw.blogspot.com/p/jmh-related-posts.html

There's a sbt plugin at https://github.com/ktoso/sbt-jmh that will help.

jsw commented 7 years ago

@wsargent My use cases are:

obtain a String or Array[Byte] to pass to Jackson mapper (I'm not using play-json)
obtain a String suitable for logging the response body

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

avdv commented 7 years ago

@wsargent Thanks for that valuable information. :+1:

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Json.parse detects the encoding of the data and decodes them appropriately.

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

r.body decodes the data using the iso-8859-1 encoding which is the default behaviour of the AsyncHttpClient, AFAIK.

new String(...) uses the default platform encoding to decode the data, which is probably UTF-8 on Linux, CP1250 on Windows, et cetera.

gmethvin commented 7 years ago

The remaining problem being discussed here is basically that the body: String method is not smart enough to understand the rules of every content type, and I'm not sure if it should be. There are a few options here:

Restore the old behavior in which we read the charset parameter (regardless of whether the media type defines one) and default to UTF-8 for non text content types and ISO-8859-1 for text. That would cover a lot of cases, but it would still be broken for JSON sent as UTF-16, and virtually any other content type that allows charsets to be declared in another way (e.g. HTML can have charsets embedded in meta tags).
Instead of having a default implicit BodyReadable[String], we can have multiple String BodyReadables for specific cases, including one for parsing a JSON body to a string. That could use the logic here for charset detection: https://github.com/playframework/play-ws/pull/151/files/9155ef716b3c1d5699783e2aa2edf1c8b8260970#diff-0b357986f490a94794eef6c192426fd9R29

tartakynov commented 6 years ago

The issue still exists in play-ws 1.1.1. Another possible solution would be to change the body type to ByteString :speak_no_evil:

fommil commented 6 years ago

somewhat related... how can we make the server include the charset instead of trying to be too smart for its own good?

richdougherty commented 6 years ago

See possible fix on the Play server side: https://github.com/playframework/playframework/issues/8239

playframework / play-ws

json responses with 'application/json' and no charset are parsed in ISO-8859-1 charset #150