sergot / http-useragent

Web user agent class for Perl 6.
MIT License
36 stars 39 forks source link

Decode header as ISO-8859-1 #237

Closed jjatria closed 3 years ago

jjatria commented 3 years ago

Before this change, requests to servers that send UTF-8 data (or any non-ASCII data) as a header value (such as the x-duck that http://icanhazip.com sets to 🦆) would be unparseable, and die with

Will not decode invalid ASCII (code point (-16) < 0 found)

error.

Although servers shouldn't be sending ducks in HTTP/1.1 headers, clients also shouldn't die when trying to parse those responses.

On this, RFC 7230 § 3.2.4 states:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

It is not clear what the spec suggests clients should do regarding this 'opaque data', but the solution in this patch means that responses can be parsed, and users that do want to recover whatever opaque data was sent in a header can do so by re-encoding the header value and decoding in whatever way is appropriate.

The change in this case is a little broader than ideal, since it interprets the entire header chunk (including the status line and all headers) as ISO-8859-1. Mitigations for this could be put in place if needed, but some more involved refactoring would likely be required.

For what it's worth, this solution will make HTTP::UserAgent produce similar results to Cro::HTTP::Client, and HTTP::Tiny. Curl-based Raku libraries will, on the other hand, decode the duck as UTF-8 which, although funnier, is arguably incorrect.

Fixes #236

ugexe commented 3 years ago

fwiw Perl libraries seem to match the curl behavior:

$ perl -MHTTP::Tiny -MData::Dumper -E 'say Dumper(HTTP::Tiny->new->get("http://icanhazip.com"))'

$VAR1 = {
          'protocol' => 'HTTP/1.1',
          'headers' => {
                         'x-rtfm' => 'Learn about this site at http://bit.ly/icanhazip-faq and do not abuse the service.',
                         'content-length' => '13',
                         'date' => 'Mon, 02 Nov 2020 21:37:47 GMT',
                         'x-duck' => '🦆',
                         'connection' => 'close',
                         'server' => 'nginx',
                         'access-control-allow-origin' => '*',
                         'content-type' => 'text/plain; charset=UTF-8',
                         'access-control-allow-methods' => 'GET',
                         'x-donation' => 'This site is expensive to run. You can donate BTC to 3LSp89k9qnMJBpV7AUNF3M2Eo1vatpkYpm',
                         'x-node' => 'icanhazip-iad-1'
                       },
          'success' => 1,
          'url' => 'http://icanhazip.com',
          'status' => '200',
          'content' => '204.60.6.160
',
          'reason' => 'OK'
        };
$ perl -MLWP::UserAgent -MData::Dumper -E 'say Dumper(LWP::UserAgent->new->get("http://icanhazip.com"))'

$VAR1 = bless( {
                 '_request' => bless( {
                                        '_headers' => bless( {
                                                               'user-agent' => 'libwww-perl/6.05'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'GET',
                                        '_content' => '',
                                        '_uri' => bless( do{\(my $o = 'http://icanhazip.com')}, 'URI::http' ),
                                        '_uri_canonical' => bless( do{\(my $o = 'http://icanhazip.com/')}, 'URI::http' )
                                      }, 'HTTP::Request' ),
                 '_msg' => 'OK',
                 '_rc' => '200',
                 '_headers' => bless( {
                                        'server' => 'nginx',
                                        'content-type' => 'text/plain; charset=UTF-8',
                                        'date' => 'Mon, 02 Nov 2020 21:37:01 GMT',
                                        'access-control-allow-methods' => 'GET',
                                        'client-response-num' => 1,
                                        'client-date' => 'Mon, 02 Nov 2020 21:37:01 GMT',
                                        'x-rtfm' => 'Learn about this site at http://bit.ly/icanhazip-faq and do not abuse the service.',
                                        'access-control-allow-origin' => '*',
                                        'connection' => 'close',
                                        '::std_case' => {
                                                          'x-node' => 'X-Node',
                                                          'client-peer' => 'Client-Peer',
                                                          'x-duck' => 'X-Duck',
                                                          'x-donation' => 'X-Donation',
                                                          'access-control-allow-methods' => 'Access-Control-Allow-Methods',
                                                          'client-date' => 'Client-Date',
                                                          'client-response-num' => 'Client-Response-Num',
                                                          'x-rtfm' => 'X-Rtfm',
                                                          'access-control-allow-origin' => 'Access-Control-Allow-Origin'
                                                        },
                                        'content-length' => '13',
                                        'x-node' => 'icanhazip-iad-1',
                                        'client-peer' => '136.144.56.255:80',
                                        'x-donation' => 'This site is expensive to run. You can donate BTC to 3LSp89k9qnMJBpV7AUNF3M2Eo1vatpkYpm',
                                        'x-duck' => '🦆'
                                      }, 'HTTP::Headers' ),
                 '_protocol' => 'HTTP/1.1',
                 '_content' => '204.60.6.160
'
               }, 'HTTP::Response' );

I also tried parsing x-duck: 🦆 with https://github.com/ugexe/Perl6-Grammar--HTTP but it caused it to not parse any headers at all. Then I tried Net::HTTP and got the same results as the other Raku libraries:

$ raku -MNet::HTTP::GET -e 'say Net::HTTP::GET("http://icanhazip.com").header<X-Duck>.raku'

$["ð\x[9F]¦\x[86]"]

Which makes me wonder if all these Raku libraries are doing the best thing. Regardless, both of these behaviors are better than the existing behavior this replaces.

jjatria commented 3 years ago

Which makes me wonder if all these Raku libraries are doing the best thing.

Aside from decoding as ASCII or ISO-8859-1, I don't think parsing header values as UTF-8 is justified, at least based on the spec. What would you say would be "the best thing", @ugexe?