libo26 / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

'apos' character entity not handled properly #286

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It looks like when twitter generates an RSS feed, it double-escapes certain 
special characters in the <description /> field. For example let's say I tweet:

    I can't parse this!

Which is actually

I can&apos;t parse this!

in HTML entities.

The when you look at the bare XML from Twitter's RSS or Atom feed, it is 
rendered thusly:

I can&amp;apos;t parse this!

Universal Feed Parser appears to have some serious problems with this. When you 
parse out one of the entries and look at how it parses this, you end up with:

 I can&amp;apost parse this!

which renders on the screen as

I can&apost parse this!

Any ideas how I can get this to behave? When I open the feed in Firefox, the 
entities are handled correctly, so clearly it's possible to parse the string 
correctly.

Original issue reported on code.google.com by jordanth...@gmail.com on 17 Jun 2011 at 5:12

GoogleCodeExporter commented 9 years ago
Note that this is note a duplicate of Issue #66 as in that case the data was 
enclosed in a CDATA whereis here it isn't.

Original comment by jordanth...@gmail.com on 17 Jun 2011 at 6:26

GoogleCodeExporter commented 9 years ago
I glanced at Twitter but I'm not seeing this behavior. Got a link?

Original comment by kurtmckee on 18 Jun 2011 at 10:02

GoogleCodeExporter commented 9 years ago
Here you go, made just for you!
http://search.twitter.com/search.rss?q=%23feedparser

And you can always do it by copying this

can't 

and pasting it into a tweet.

Thanks!

Original comment by jordanth...@gmail.com on 19 Jun 2011 at 6:33

GoogleCodeExporter commented 9 years ago
Looks like you can create any tweet containing an apostrophe, and feedparser 
behaves in this manner. It might very well be true for everything that twitter 
escapes in this manner. feedparser knows that if it sees, say

<a href="something">link</a>

It should "unescape" it to 

<a href="something">link</a>

ut it *doesn't* seem to understand that ampersands *also* need to be unescaped.

u

According to feedparser.__version__ this is 5.0.1 and I'm using this in Python 
2.7.1

Original comment by jordanth...@gmail.com on 19 Jun 2011 at 6:40

GoogleCodeExporter commented 9 years ago
Here is the code to replicate the error:
>>> import feedparser
>>> feed = 
feedparser.parse('http://search.twitter.com/search.rss?q=%23feedparser')
>>> entry = feed.entries[0]
>>> print entry.summary
Testing <a 
href="http://search.twitter.com/search?q=%23feedparser">#feedparser</a>; 
checking whether it can&apost parse this.
Should be
Testing <a 
href="http://search.twitter.com/search?q=%23feedparser">#feedparser</a>; 
checking whether it can't parse this.

Original comment by jordanth...@gmail.com on 19 Jun 2011 at 6:42

GoogleCodeExporter commented 9 years ago
Thanks for the link; user feeds apparently don't have this problem, so if you 
compare your tweet between your user feed and the search feed you'll see why I 
was scratching my head! :)

My first guess is that this may be stemming from Python's `htmlentitydefs` 
module not defining 'apos' as an entity, which might be why 'amp' is being 
re-escaped. I'll look into it further when I have an opportunity.

Original comment by kurtmckee on 19 Jun 2011 at 9:03

GoogleCodeExporter commented 9 years ago
I should have uploaded the problem feed when I had the opportunity: the feed 
linked to above no longer has any entries. :( Would you please re-tweet to 
demonstrate the bug again? Thanks!

Original comment by kurtmckee on 2 Sep 2011 at 6:25

GoogleCodeExporter commented 9 years ago
There you go! Use it or lose it. :)

Original comment by jordanth...@gmail.com on 2 Sep 2011 at 3:09

GoogleCodeExporter commented 9 years ago
It appears Twitter updated their feed templates; I'm not seeing this behavior 
in the search feed linked above.

Original comment by kurtmckee on 2 Sep 2011 at 4:31

GoogleCodeExporter commented 9 years ago
Oh dear.

Original comment by jordanth...@gmail.com on 2 Sep 2011 at 6:38

GoogleCodeExporter commented 9 years ago
Fixed in r594.

Original comment by kurtmckee on 13 Sep 2011 at 5:29