Remove ISO-8859-1 charset fallback

Lukasa commented 10 years ago

For a long time we've had a fallback value in response.encoding of ISO-8859-1, because RFC 2616 told us to. RFC 2616 is now obsolete, replaced by RFCs 7230, 7231, 7232, 7233, 7234, and 7235. The authoritative RFC on this issue is RFC 7231, which has this to say:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says.

The media type definitions for text/* are most recently affected by RFC 6657, which has this to say:

In accordance with option (a) above, registrations for "text/*" media types that can transport charset information inside the corresponding payloads (such as "text/html" and "text/xml") SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the "charset" parameter value and the value specified in the payload disagree.

I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.

I propose the following changes:

Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.
Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".
Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.

sigmavirus24 commented 10 years ago

Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.

I agree that we should remove the fallback. I do wonder how we should handle Response#text in the event that the server does not specify a charset (in anyway, including the meta tags of the body). Should we disable Response#text conditionally either through an exception or something else? Not doing so will rely more heavily on chardet, which I have decreasing confidence in given the number of new encodings it does not detect.

Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".

Given that this is not guaranteed to be included in requests, I'm fine with adding it to the toolbelt, that said. I'm also okay with making this a separate package so users can just use that with out having to install the rest of the toolbelt. That, however, is a separate discussion.

Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.

We still have a method to do this in utils, right? I don't like the idea in the slightest, but it won't cost extra effort. That said, we have to make sure any charset provided in the media type takes precedence.

sigmavirus24 commented 10 years ago

Upon reading more int RFC 7231, specifically Section 3.1.1.5 I think the third option should ideally be opt-in, not opt-out. My specific reasoning for this is:

Clients that do so [examine a payload's content] risk drawing incorrect conclusions, which might expose additional security risks (e.g., "privilege escalation").

Taken from the same section I linked above.

Lukasa commented 10 years ago

Agreed from a correctness perspective, but I wonder if @kennethreitz is going to want it from a usability perspective.

sigmavirus24 commented 10 years ago

I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.

untitaker commented 10 years ago

If only 1.) is going to be implemented, i guess r.encoding = None and requests will use chardet?

Lukasa commented 10 years ago

Correct. =)

sigmavirus24 commented 10 years ago

That's how it works now, so I don't think we'd change that.

gsnedders commented 10 years ago

I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.

Look up all the UTF-7 XSS attacks. (None work in any current browser, as everyone simply dropped UTF-7 sniffing — and most UTF-7 support entirely — to avoid making sites thus vulnerable.)

In a very real sense, using chardet is worse than option three above — it will make different conclusions to what any implementation following a specification defining how to sniff the content would (and both HTML and XML provide such a specification). The only safe thing to do is if you don't know how to determine the charset is to not try. You can probably support the vast majority of users by implementing the (standardised) HTML and XML character encoding detection algorithms.

I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.

Hmm, the registration (which is included inline in the HTML spec) contradicts the HTML spec itself — per the HTML spec, UTF-8/UTF-16 BOMs are given precedence over the MIME type. I've filed bug 26100 for that.

kennethreitz commented 10 years ago

Hmmm....

gsnedders commented 9 years ago

http://html5.org/tools/web-apps-tracker?from=8723&to=8724 fixes the IANA registration in the HTML spec to match the body of it. It now reads:

The charset parameter may be provided to specify the document's character encoding, overriding any character encoding declarations in the document other than a Byte Order Mark (BOM). The parameter's value must be one of the labels of the character encoding used to serialise the file. [ENCODING]

erowan commented 9 years ago

Hello,

I just got hit by this reading XML files which were encoded as UTF8. On OSX the content type was being returned as 'application/xml' but on linux it was set to 'text/xml' therefore the requests lib assumed its default encoding of 'ISO-8859-1' as 'text' was in the content. Most XML files will be encoded in UTF8 so setting the encoding as 'ISO-8859-1' for 'text/xml' content is surely wrong as discussed.

kennethreitz commented 9 years ago

Just because an RFC specifies something, doesn't mean we should do it. Especially if it makes the code crazy.

I believe that our current behavior is elegant and actually works quite effectively. Is this not the case?

As always, what does Chrome do?

Lukasa commented 9 years ago

Chrome introspects the HTML, a position we've always decided we don't want to do. We could optionally add support for hooks to do content-type specific encoding heuristics if we wanted. We already kinda do that for JSON, it might not hurt to do it more generally for other content types.

kennethreitz commented 9 years ago

Grumble.

gsnedders commented 9 years ago

Note the HTML case is even worse than that, really. Because the pre-scan in browsers just looks at the first 1024 bytes or there abouts, and the parser itself can then change it.

kennethreitz commented 9 years ago

I still feel like our current behavior is a great implementation.

Lukasa commented 9 years ago

Ok. =)

How about I knock up a proposal for smarter encoding heuristics, that takes certain known content-types and attempts to gently introspect them for their encodings. At least then we'll have something concrete to discuss.

est commented 8 years ago

@Lukasa @kennethreitz Hey guys, for the time being, I don't think we have a obvious solution yet, but can we at least make this ISO-8859-1 optional?

    if 'text' in content_type:
        return 'ISO-8859-1'

This looks way too brutal. Some parameters like Session(auto_encoding=False) would be nice. What do you guys think?

Lukasa commented 8 years ago

@est ISO-8859-1 is optional, you can simply set response.encoding yourself. It's a one-line change. =)

gsnedders commented 8 years ago

@Lukasa But you can't determine whether the initial response.encoding came from the Content-Type header or whether it's the ISO-8859-1 fallback, which means if you want to avoid the fallback you have to start parsing the Content-Type header yourself, and that's quite a lot of extra complexity all of a sudden.

Lukasa commented 8 years ago

@gsnedders Sure you can. if 'charset' in response.headers['Content-Type'].

gsnedders commented 8 years ago

While yes, that works under the assumption the other party is doing something sane, it doesn't work in the face of madness and something like text/html; foo=charset.

Lukasa commented 8 years ago

@gsnedders Try if 'charset=' in response.headers['Content-Type']. At this point we're out of 'crazy' and into 'invalid formatting'.

gsnedders commented 8 years ago

@Lukasa uh, I thought there was whitespace (or rather what 2616 had as implicit *LWS) allowed around the equals, apparently not.

The grammar appears to be:

     media-type = type "/" subtype *( OWS ";" OWS parameter )
     type       = token
     subtype    = token

   The type/subtype MAY be followed by parameters in the form of
   name=value pairs.

     parameter      = token "=" ( token / quoted-string )

So I guess the only issue here is something like text/html; foo="charset=bar".

FWIW, html5lib's API changes are going to make the correct behaviour with requests require something like:

r = requests.get('https://api.github.com/events')
e = response.encoding if 'charset=' in response.headers['Content-Type'] else None
tree = html5lib.parse(r.content, transport_encoding=e)

Lukasa commented 8 years ago

That seems reasonable. =)

FWIW, in requests 3.0.0 we'll reconsider this, or at least consider adding some way of working out what decision we made.

dentarg commented 8 years ago

Interesting discussion.

I still feel like our current behavior is a great implementation.

If I may, a counterexample. I use requests to extract the <title> from a requested document, and here is facebook.com.

>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> r.encoding
'ISO-8859-1'
>>> import re
>>> m = re.search('<title[^>]*>\s*(.+?)\s*<\/title>', r.text, re.IGNORECASE|re.MULTILINE)
>>> m.group(1)
u'Mathias Sundin - M\xc3\xa5nga roliga samtal mellan Tony Blair och... | Facebook'
>>> r.headers['Content-Type']
'text/html'

Apparently they don't specify the encoding in their headers. But they do in the HTML (<meta charset="utf-8" />, full example at https://gist.github.com/dentarg/f13ef7cc0ce55753faf6).

As mentioned in #2161, "requests aren't a HTML library", and I can understand that point of view. Perhaps I should be looking into parsing the HTML with some library that also can take the specified encoding into consideration.

Thank you for your work on requests. I'm looking forward to 3.0.0.

Lukasa commented 8 years ago

@dentarg That's not really a counter example: it's an example of why this system works.

The guidance from the IETF is that for all text/* encodings, one of the following things MUST be true:

the peer MUST send a charset in the content-type
the content MUST carry an encoding specifier in it (HTML, XML)

Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag. Anyone working with HTML really should go looking for that tag, because servers almost never correctly advertise the content type of the HTML they deliver, but the HTML itself is usually (though sadly not always) right.

The behaviour requests has right now is probably less prone to failure with HTML than the one proposed for 3.0.0, and part of me does wonder if we should try to solve this more by documentation than by code change.

dentarg commented 8 years ago

Yes, perhaps documentation is the way forward.

I can share my experience. I'm a very new user of requests, and I haven't looked at all the documentation for requests, but I have looked some. The requests website have this in the code snippet on the front page:

>>> r.encoding
'utf-8'

That information, combined with me somehow (maybe from browsing issues here on GitHub) finding out that requests uses chardet, gave me the impression that requests would solve the charset/encoding problem for me "all the way".

Once I found the right documentation, it was straightforward to workaround the limitations with another library. I can fully understand that requests just want to be the HTTP parser, not the HTML parser.

All that said, let me share some short examples where I think the ISO-8859-1 fallback might cause some unexpected behavior.

>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> from bs4 import BeautifulSoup

You can't use r.text:

>>> print BeautifulSoup(r.text, "html5lib", from_encoding="").title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook

>>> print BeautifulSoup(r.text, "html5lib", from_encoding=r.encoding).title.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 215, in __init__
    self._feed()
  File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 239, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python2.7/site-packages/bs4/builder/_html5lib.py", line 50, in feed
    doc = parser.parse(markup, encoding=self.user_specified_encoding)
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 236, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 89, in _parse
    parser=self, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/html5lib/tokenizer.py", line 40, in __init__
    self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
  File "/usr/local/lib/python2.7/site-packages/html5lib/inputstream.py", line 144, in HTMLInputStream
    raise TypeError("Cannot explicitly set an encoding with a unicode string")
TypeError: Cannot explicitly set an encoding with a unicode string

You need to use r.content, but if requests have fallen back to ISO-8859-1, r.encoding will cause trouble:

>>> print BeautifulSoup(r.content, "html5lib", from_encoding=r.encoding).title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook

Working as intended:

>>> print BeautifulSoup(r.content, "html5lib", from_encoding="").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook

>>> print BeautifulSoup(r.content, "html5lib", from_encoding="utf-8").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook

Lukasa commented 8 years ago

So here we hit a problem, which is that we can't really document the way Requests interacts with every possible tool you may want to use it with: there are just too many possibilities! So instead we try to provide general documentation.

The best advice I can give is that the more specific the tool, the more likely it can handle bytes as an input and DTRT with them. BeautifulSoup is a HTML tool and so can almost certainly take a stream of arbitrary bytes and find the relevant meta tag (and indeed, it can), so you can just pass it r.content. The same would be true if you were passing it to an XML library, or to a JSON library, or to anything else like that.

gsnedders commented 8 years ago

Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag.

Sadly, that isn't true, because of UTF-16. In both cases, they should be able to handle BOMs. Furthermore, a conforming XML parser should be able to handle a UTF-16 encoded <?xml version="1.0" encoding="UTF-16"> with no preceding BOM.

Lukasa commented 8 years ago

@gsnedders I'm not sure what your point is. ISO-8859-1 will decode incorrectly, but you can still search for the initial <meta> tag. It's not trivial but it is not enormously difficult either.

gsnedders commented 8 years ago

@Lukasa My point is that you're not necessarily just looking for a <meta> tag, you could also be wanting to see if the opening bytes are 00 3C 00 3F (which obviously you can do as they'll just be U+0000 U+003C U+0000 U+003F), and then start decoding as UTF-16 until you find the inline encoding declaration. There's a lot more complexity (esp. in the XML case) than just looking for one ASCII string. ISO-8859-1 isn't that bad insofar as it causes no data loss, but it doesn't really make searching any easier.

Lukasa commented 8 years ago

So agreed that there is more complexity here, but I don't believe that requests can meaningfully take that complexity on. The spectrum of options in this situation is:

Requests chooses ISO-8859-1 for .text, ensuring that accessing .text doesn't explode. This may well be wrong, but it's fast and ensures that the user can progress.
Requests makes no decision, falling back on chardet. Chardet can be quite slow, which slows down the initial load of .text. Additionally, it can get it wrong, meaning that sometimes .text will randomly throw UnicodeDecodeError when accessed incautiously.
Requests explicitly chooses no encoding and throws its own special exception when accessing .text.

I'd argue that of these three options, option 1 is the least user hostile. Essentially, we promise that accessing .text will never cause random exceptions for HTML or XML. We can argue over whether .text should exist at all, but it does, and in a world where it does I'd rather that everything doesn't randomly explode on users simply because they were incautious.

jvanasco commented 7 years ago

I've had more than a handful of problems from UTF-8 content being read as ISO-8859-1. I've handled this for now by setting requests.Session().hooks to adjust the response.encoding with a custom def that analyzes the response. It works, but I incredibly dislike the hooks system as it's not well documented, and consequently feels "fragile".

Something that I would like to propose for 3.0.0 is a variation of the current behavior (#1) which can be overridden by end-users/consumers/developers with a dedicated hook into the session, like howSessionRedirectMixin handles this functionality with resolve_redirects. This would allow requests to stay purely HTTP and backwards compatible, but allow for users to customize their usage (e.g. based on HTML, likelihood of a language on an IP block, etc). [A much more complicated way would be allowing for the session to have custom HttpAdapters]

Edit: I forgot to mention-

I also found it very beneficial (at least for debugging) to "tier" how the encoding is accessed. We basically use this logic:

r._encoding_fallback = 'ISO-8859-1'
# modified version, returns `None` if no charset available
r._encoding_headers = get_encoding_from_headers(r.headers)
r._encoding_content = None
if not r._encoding_headers and r.content:
    # html5 spec requires a meta-charset in the first 1024 bytes
    r._encoding_content = get_encodings_from_content(r.content[:1024])
if r._encoding_content:
    r.encoding = r._encoding_content[0]  # it's a list
else:
    r.encoding = r._encoding_headers or r._encoding_fallback

i think a similar approach would be good for requests. that would allow users to quickly inspect how the encoding was derived when troubleshooting.

candlerb commented 6 years ago

The spectrum of options in this situation is:

Requests chooses ISO-8859-1 for .text, ensuring that accessing .text doesn't explode. This may well be wrong, but it's fast and ensures that the user can progress.

Requests makes no decision, falling back on chardet. Chardet can be quite slow, which slows down the initial load of .text. Additionally, it can get it wrong, meaning that sometimes .text will randomly throw UnicodeDecodeError when accessed incautiously.

Requests explicitly chooses no encoding and throws its own special exception when accessing .text.

Requests chooses utf-8 with surrogateescape (PEP-0383), which also ensures that .text doesn't explode (and is reversible)

>>> b'123\xff'.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

>>> b'123\xff'.decode("utf-8", "surrogateescape")
'123\udcff'

>>> b'123\xff'.decode("utf-8", "surrogateescape").encode("utf-8", "surrogateescape")
b'123\xff'

According to the PEP it was introduced in python 3.1, although it's not mentioned in 3.1 whatsnew. In 3.2 whatsnew it says in passing that tarfile uses it.

candlerb commented 6 years ago

Aside: this might be less important if the user had more control. At the moment, you can set encoding but the errors property is hard-coded to replace. Some possible ideas:

Add Response.errors property?

r = requests.post(...)
r.encoding="utf-8"
r.errors="surrogateescape"
print(r.text)

Let Response.encoding be a tuple?

r = requests.post(...)
r.encoding = ("utf-8", "surrogateescape")
print(r.text)

Allow arguments to text()?

r = requests.post(...)
print(r.text("utf-8", "surrogateescape"))

However, I suppose you can already just do r.content.decode("utf-8","surrogateescape")

jvanasco commented 6 years ago

@candlerb your suggestions happen after the request has processed. and detecting the correct (better) option is important to do asap because there are scriptable elements of requests like custom subclasses hooks and the kwarg hooks system, which lets developers have much more control over the processing. Those all need to have the 'most correct' encoding calculated before they run, which would essentially be r = requests.post(<in here, somewhere>)

jumping back to the discussion on Jan 9, 2016 between @dentarg and @Lukasa , I think the root issue is that ISO-8859-1 is "western latin", and while we're all using that to write English, HTML tags an HTTP Headers... a sizeable, larger, portion of the world doesn't use our same alphabet.

when we're consuming Western content, this approach is fine - but when you're looking at truly global content, it is a complete pain. our indexers were getting tripped up a lot by websites in Asia and the Middle East that don't identity a charset in the headers, and opt for identifying it with a content tag instead.

we've had good luck in production with the ability to set a custom encoding_fallback per request/session, and then analyze everything within a custom hook (as above). if we fail, on a detection, we try to geolocate the IP and adjust the fallback charset. it has been working well. there are just certain times where you know the fallback will not be ISO-8859-1.

i'm 100% sure there is a 10x better method, but having pulled well over 1BN+ pages with requests - many of which are non-western - but i think it's the best approach on this page so far.

psf / requests

Remove ISO-8859-1 charset fallback #2086