text/xml with no encoding rejected while 3 other feed validators accept it

GoogleCodeExporter commented 9 years ago

I'm having this issue with this feed and a few others:

http://www.thesource.uk.com/The_Source_Church/News/rss.xml

The feed is a valid feed and it's xml starts with:

<?xml version="1.0" encoding="UTF-8"?>

However, when it's fetched, it returns this header (irrelevant headers omitted):

'content-type': 'text/xml'

So Feedparser rejects this feed as invalid because it treats it as if it's 
ASCII when in fact it's UTF-8. This behavior is by design, according to this 
comment in the source code:
"""
    if the HTTP Content-Type is text/xml, text/*+xml, or
    text/xml-external-parsed-entity, the encoding given in the XML prefix
    within the document is ALWAYS IGNORED and only the encoding given in
    the charset parameter of the HTTP Content-Type header should be
    respected, and it defaults to 'us-ascii' if not specified.

    Furthermore, discussion on the atom-syntax mailing list with the
    author of RFC 3023 leads me to the conclusion that any document
    served with a Content-Type of text/* and no charset parameter
    must be treated as us-ascii.  (We now do this.)  And also that it
    must always be flagged as non-well-formed.  (We now do this too.)
"""

I understand that this is by design, but I tested with three feed validators 
and although they gave a warning, they all accepted the feed as valid anyway:

http://validator.w3.org/feed/check.cgi?url=www.thesource.uk.com%2FThe_Source_Chu
rch%2FNews%2Frss.xml
http://feedvalidator.org/check.cgi?url=www.thesource.uk.com%2FThe_Source_Church%
2FNews%2Frss.xml
http://www.rssboard.org/rss-validator/check.cgi?url=www.thesource.uk.com%2FThe_S
ource_Church%2FNews%2Frss.xml

So the issue here is that feedparser is not behaving inline with these 3 feed 
validators. So when I tell a customer that their feed is invalid, I get the 
"but it's valid according to the W3C validator" response. 

PROPOSED SOLUTION:
-------------------

If we relax the rules in the comment above such as the encoding in the XML is 
used when there is none specified in the HTTP Content-Type, then that would 
solve the problem. Update as follows:

    if the HTTP Content-Type is text/xml, text/*+xml, or
    text/xml-external-parsed-entity, the encoding given in
    the charset parameter of the HTTP Content-Type header should be
    used, and if missing, then the encoding given in the XML prefix
    within the document is used, and if that's missing as well, then
    we default to 'us-ascii'.

    Furthermore, discussion on the atom-syntax mailing list with the
    author of RFC 3023 leads me to the conclusion that any document
    served with a Content-Type of text/* and no charset parameter
    must be treated as us-ascii (unless an encoding is specified inside
    the XML document). 

This way we continue to respect the rule that we assume ASCII when an encoding 
is not specified anywhere, but if an encoding is specified in HTTP headers or 
in the XML document, then we use that. 

PATCH:
------
The code fix for this is very simple. In this code (in _getCharacterEncoding()):

    elif (http_content_type in text_content_types) or \
         (http_content_type.startswith(u'text/')) and http_content_type.endswith(u'+xml'):
        acceptable_content_type = 1
        true_encoding = http_encoding or u'us-ascii'

update the last line to add [or xml_encoding] as follows:

        true_encoding = http_encoding or xml_encoding or u'us-ascii'

Original issue reported on code.google.com by wal...@ninua.com on 30 Aug 2011 at 5:09

GoogleCodeExporter commented 9 years ago

I don't understand in what way feedparser is rejecting the feed. Do you mean 
that feedparser sets the `bozo` bit because of the character encoding override?

Original comment by kurtmckee on 30 Aug 2011 at 6:52

GoogleCodeExporter commented 9 years ago

Yes, it sets the bozo flag, and the bozo exception return is:

CharacterEncodingOverride(u'document declared as us-ascii, but parsed as 
utf-8',)

Original comment by wal...@ninua.com on 30 Aug 2011 at 9:02

GoogleCodeExporter commented 9 years ago

That's the behavior I'm seeing as well. I also see that feedparser parses the 
feed.

CharacterEncodingOverride is a feedparser "exception" that simply serves as a 
way for developers to see what's going on with the feed they're parsing, 
particularly if they're seeing problems with the output. It's a subclass of 
`feedparser.ThingsNobodyCaresAboutButMe`, and you can ignore the "exception" in 
your code.

As this is expected behavior that follows the specifications noted in the 
comment in `getCharacterEncoding()` I'm going to close this issue.

Original comment by kurtmckee on 30 Aug 2011 at 3:06

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Interesting. I always assumed that if the bozo flag is set, it meant that 
feedparser couldn't parse the feed. I don't recall any mention in the 
documentation of any other way to tell if the feed is parsed or not. Can you 
please clarify how you check that the feed is parsed?

Original comment by wal...@ninua.com on 30 Aug 2011 at 7:55

GoogleCodeExporter commented 9 years ago

I don't have a good recommendation; feedparser doesn't check to see if it's 
actually parsing a feed, it merely extracts data from the XML document it's 
given. As an example, a wellformed XHTML document will be parsed without 
errors, but the `feed` and `entries` attributes will be empty (assuming that 
there weren't any recognizable XML elements that feedparser was looking for).

If you're trying to figure out if the URL a user inputted is actually a feed, 
you might sniff the first 512 bytes (or some other arbitrary number), which is 
what Firefox did last time I checked.

Original comment by kurtmckee on 31 Aug 2011 at 5:00

pombreda / feedparser

text/xml with no encoding rejected while 3 other feed validators accept it #302