content-encoding aware string parsing & encoding

http-rs / http-types

Common types for HTTP operations

https://docs.rs/http-types

Apache License 2.0

200 stars 83 forks source link

content-encoding aware string parsing & encoding #34

Open yoshuawuyts opened 4 years ago

yoshuawuyts commented 4 years ago

Similar to https://github.com/http-rs/surf/issues/101 and https://github.com/http-rs/surf/pull/108 we should make string parsing and encoding aware of the Content-Type header.

cc/ @goto-bus-stop; pinging you here since you contributed the patch to Surf; we're planning to move Tide and Surf both over to http-types, so figured it'd make sense to include the work you've done here.

rylev commented 4 years ago

Talked with @yoshuawuyts, and we believe the point of integration is overwriting to the read_to_string method for AsynRead on Request and Response. Instead of returning an error if the underlying bytes are not utf-8, we'll make a best attempt to convert to utf-8 based on the content-type header

goto-bus-stop commented 4 years ago

I think this should be a property of the Body now right? like res.into_body().read_to_string(&mut s)

i'm tinkering with making a encoding_rs_io but for AsyncRead, so you can do res.into_body().into_utf8_reader() to stream utf-8 transcoded bytes, or into_utf8_reader().read_to_string() to collect it all into a big string.

on the http-types end, the Body needs to be made aware of the actual mime type, since right now it appears to always be octet-stream?

yoshuawuyts commented 4 years ago

@goto-bus-stop What I'm wondering about is: are there any cases where we wouldn't want to decode based on the encoding type?

I suspect the answer here might be "probably not", which is: maybe we should handle this out of the box? So the proposed API would change:

Previous proposal

let req = Request::new(Url::new("https://example.com"));
let reader = req.into_utf8_reader();
let mut s = String::new();
reader.read_to_string(&mut s).await?;

Current proposal

let req = Request::new(Url::new("https://example.com"));
let mut s = String::new();
req.read_to_string(&mut s).await?;

The biggest question remaining then seems: do we need an escape hatch? I'm not sure tbh. Perhaps there could be a flag on the Body that disables the decoder?

let req = Request::new(Url::new("https://example.com"));

let mut body = req.into_body();
body.decoder(false);

let mut s = String::new();
body.read_to_string(&mut s).await?;

yoshuawuyts commented 4 years ago

on the http-types end, the Body needs to be made aware of the actual mime type, since right now it appears to always be octet-stream?

When encoding a stream we indeed don't assume anything, and set octet-stream. But for strings we correctly set text/plain; utf-8, and JSON similarly application/json; utf-8.

When decoding a stream we don't do anything with text encodings. Most of my previous comment was about decoding only. Is there anything about encoding you think we should be doing?

goto-bus-stop commented 4 years ago

for encoding we probably don't need non-utf8 support (certainly not initially). it would be kinda strange in 2020 to write a server in a cutting-edge language and framework if you need to service clients that don't support utf8 (do they exist?).

i was primarily wondering about decoding as well, since the Body does not know what its encoding is when it receives a response (assuming http-types use on the client-side). I think the Response type could apply the mime type to the Body from inside the set_header function but i'm not totally sure if that's correct.

I don't think we need a custom escape hatch since you can already get the raw bytes. If you really want to use utf8 for some reason even if the server explicitly advertises a different encoding, you can do String::from_utf8(body.read_to_end()), i guess.

yoshuawuyts commented 4 years ago

I think the Response type could apply the mime type to the Body from inside the set_header function but i'm not totally sure if that's correct.

I think we could indeed use set_header for this. There's a slight nuance because Request is shared between both client and server, and we only want to use encodings on the client. But that's perhaps a matter of exposing a flag we can enable/disable on one side, but not the other.