If content-type ends with /html, it's being treated as binary

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.  Content tags usually have a content type that ends with /xml or /text. 
Feedparser handles those correctly. But if the content type ends with /html, 
feedparser considers it to be binary.

What is the expected output? What do you see instead?
Feedparser should handle the content tag correctly if it's type ends with /html 
just like it's handling it correctly when it ends with /xml and /text. 

The bug is in this function:

    def _isBase64(self, attrsD, contentparams):
        if attrsD.get('mode', '') == 'base64':
            return 1
        if self.contentparams['type'].startswith(u'text/'):
            return 0
        if self.contentparams['type'].endswith(u'+xml'):
            return 0
        if self.contentparams['type'].endswith(u'/xml'):
            return 0
        return 1

Adding this line at the end before "return 1" fixes it (patch attached):

        if self.contentparams['type'].endswith(u'/html'):
            return 0

Original issue reported on code.google.com by wal...@ninua.com on 21 Jun 2011 at 1:28

Merged into: #284

Attachments:

0001-Fix-to-feedparser-if-the-type-attribute-ends-with-ht.patch

GoogleCodeExporter commented 9 years ago

The mimetype for HTML is typically "text/html", so I assume that you're seeing 
a different mimetype. Do you have a link to a live service or feed that's 
sending a mimetype that requires this change? Even if not, would you reply back 
with the exact mimetype that's triggering this bug? I'd like to try to 
accommodate your need while revamping the Base64 code to fix some other issues.

Original comment by kurtmckee on 5 Jul 2011 at 12:29

GoogleCodeExporter commented 9 years ago

Ah, that was a long time ago. I don't remember the feed or the mime type it 
used, but I remember that another tool parsed it (it was either the 
feedvalidator or SimplePie), so I had to match that or be blamed of being 
unable to handle feeds that others can. After digging into it, I added the fix 
above. 

I do remember, though, that when I looked at _isBase64() I thought it was odd 
that it treated everything as base64 unless it matched the specified 
conditions, as opposed to the opposite of defaulting to treating it as text 
(which is more common) unless it's specifically declare as base 64. Maybe there 
is a good reason for that, so I made the minimal change that fixed my problem.

Original comment by wal...@ninua.com on 5 Jul 2011 at 9:57

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 6 Sep 2011 at 3:19

Changed state: Duplicate

libo26 / feedparser

If content-type ends with /html, it's being treated as binary #288