playframework / play-ws

Standalone Play WS, an async HTTP client with fluent API
https://www.playframework.com/documentation/latest/JavaWS
Apache License 2.0
221 stars 87 forks source link

json responses with 'application/json' and no charset are parsed in ISO-8859-1 charset #150

Open maxcom opened 7 years ago

maxcom commented 7 years ago

JSON in HTTP are always encoded in UTF-8. Responses are parsed correctly when server writes content type header like application/json; charset=utf-8.

However, many servers (like Play framework itself) uses application/json without charset. In Play 2.6, that responses are parsed in ISO-8859-1 charset.

I think that problem is in JsonBodyReadables implementation. Json.parse(response.body) should be replaced with Json.parse(response.bodyAsBytes.utf8string). Charset for response.body is always ISO-8859-1 in async http client when no charset info is provided in content type header.

wsargent commented 7 years ago

Good catch.

gmethvin commented 7 years ago

Technically, JSON is not always UTF-8, but it is always Unicode, and application/json does not define a charset parameter. I would use Json.parse(response.bodyAsBytes.toArray) directly, since it will detect the encoding.

jsw commented 7 years ago

I've run into the same issue, where response.body does not correctly handle an emoji character. I'm not using play json to map the response body (manual jackson with scala module). Is the appropriate approach to use response.bodyAsBytes.toArray to get the raw bytes, which I can pass to Jackson, and use new String(response.bodyAsBytes.toArray) to get something for a log statement? I'm guessing that response.body is causing play to parse to JSON and then output to String, which is causing the encoding issue.

Malax commented 7 years ago

Is there an ETA on a release that includes this fix?

wsargent commented 7 years ago

@Malax within the next couple of weeks -- however, do note that due to the way that the standalone client works, it's fairly easy to create your own custom body readables and writables as a workaround.

Malax commented 7 years ago

@wsargent Thanks for the reply! 👍 If it's in the next couple of weeks, we will just delay the migration to 2.6 until then.

jsw commented 7 years ago

@wsargent Can you comment on my workaround above for accessing the raw bytes when I don't want to use the play JSON parser?

wsargent commented 7 years ago

@jsw You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

avdv commented 7 years ago

You are broadly correct -- the best option is to use Json.parse(response.bodyAsBytes.toArray) which does UTF-8/UTF-16 encoding detection correctly.

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

jsw commented 7 years ago

Unless I'm doing things wrong, I think the issue still exists in play-ws 1.0.1.

import org.scalatestplus.play.PlaySpec
import play.api.mvc.Action
import play.api.mvc.Results.Ok
import play.api.test.Helpers._
import play.api.test.WsTestClient
import play.core.server.Server
import scala.concurrent.ExecutionContext.Implicits.global

class JsonEncodingTest extends PlaySpec {
  "ws" should {
    "handle utf8" in {
      Server.withRouter() {
        case _  => Action { Ok("""{"foo": "👍"}""").as(JSON) }
      } { implicit port =>
        WsTestClient.withClient { client =>
          println(await(client.url("/").get().map(_.body.toString))) // unexpected output
          println(await(client.url("/").get().map(r => new String(r.bodyAsBytes.toArray)))) // expected output
        }
      }
    }
  }
}

Output

{"foo": "👍"}
{"foo": "👍"}
wsargent commented 7 years ago

You should be calling

client.url("/").get().map(r => Json.parse(r.bodyAsBytes.toArray))
wsargent commented 7 years ago

Would it be feasible to avoid the Array[Byte] -> ByteString -> Array[Byte] copying by using the "underlying" response?!

In theory yes, in practice array copying usually doesn't yield much in the way of performance improvements because the JVM is pretty effective at array copying:

https://psy-lob-saw.blogspot.com/2015/04/on-arraysfill-intrinsics-superword-and.html

If you're interested, you can poke at the JVM overhead with JMH:

https://psy-lob-saw.blogspot.com/p/jmh-related-posts.html

There's a sbt plugin at https://github.com/ktoso/sbt-jmh that will help.

jsw commented 7 years ago

@wsargent My use cases are:

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

avdv commented 7 years ago

@wsargent Thanks for that valuable information. :+1:

Given that, I'm not sure why I would want to do an extra parse via Json.parse and then convert that back to a String. Can you elaborate?

Json.parse detects the encoding of the data and decodes them appropriately.

Also, can you explain why r.body.toString produces different output than new String(r.bodyAsBytes.toArray)?

r.body decodes the data using the iso-8859-1 encoding which is the default behaviour of the AsyncHttpClient, AFAIK.

new String(...) uses the default platform encoding to decode the data, which is probably UTF-8 on Linux, CP1250 on Windows, et cetera.

gmethvin commented 7 years ago

The remaining problem being discussed here is basically that the body: String method is not smart enough to understand the rules of every content type, and I'm not sure if it should be. There are a few options here:

tartakynov commented 6 years ago

The issue still exists in play-ws 1.1.1. Another possible solution would be to change the body type to ByteString :speak_no_evil:

fommil commented 6 years ago

somewhat related... how can we make the server include the charset instead of trying to be too smart for its own good?

richdougherty commented 6 years ago

See possible fix on the Play server side: https://github.com/playframework/playframework/issues/8239