libo26 / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Title tag is not correctly parsed for feeds from TechCrunch, NewTeeVee, etc. #219

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. import feedparser
f=feedparser.parse('http://feedproxy.google.com/TechCrunch')
for for e in f.entries:
  print e.title
ya
jason
240px-No-fanboys.svg
...

The titles for all the entries are wrong, e.g. for the first entry it should be 
'Yammer 3.0 For iPhone: Now With 100% Fewer Crashes', not 'ya'. 

I've pasted the relevant feed entry below, obtained via 
urllib.urlopen(feed).read().  The problem seems to be that feedparser is 
mistaking the title attribute of an image within the entry for the entry title.

RELEVANT FEED ENTRY:

<item>\r\n\t\t<title>Yammer 3.0 For iPhone: Now With 100% Fewer 
Crashes</title>\r\n\t\t<link>http://feedproxy.google.com/~r/Techcrunch/~3/vwM866
Q1ZhQ/</link>\r\n\t\t<comments>http://techcrunch.com/2010/06/16/yammer-iphone/#c
omments</comments>\r\n\t\t<pubDate>Wed, 16 Jun 2010 20:58:54 
+0000</pubDate>\r\n\t\t<dc:creator>MG 
Siegler</dc:creator>\r\n\t\t\t\t<category><![CDATA[TC]]></category>\r\n\t\t<cate
gory><![CDATA[iPhone]]></category>\r\n\t\t<category><![CDATA[Yammer]]></category
>\r\n\r\n\t\t<guid 
isPermaLink="false">http://techcrunch.com/?p=190091</guid>\r\n\t\t<description><
![CDATA[<img class="alignright size-full wp-image-190095" title="ya" 
src="http://tctechcrunch.files.wordpress.com/2010/06/ya.png" alt="" />At the 
risk of pissing off our new office <a 
href="http://techcrunch.com/2010/06/03/goodbye-palo-alto-techcrunch-moves-to-san
-francisco/">neighbors</a>, I have a confession to make: I loathed the Yammer 
iPhone app. Don\'t get me wrong, I love <a href="http://yammer.com">Yammer</a>, 
and find it absolutely vital to our work. But the app was easily the least 
stable of the dozens of apps I have \xc2\xa0on my iPhone. It was so bad, in 
fact, that I\'ve been accessing Yammer through mobile Safari in recent weeks. 
But that\'s why I\'m happy to announce that today, with the launch of the 
latest version of the app, 3.0, my nightmare is over.\n\nAs they note in the 
App Store description, Yammer 3.0 for iPhone is a complete re-write of the app. 
It promises to fix "many crashes," load "much faster," and even work on the 
upcoming iPhone 4. A quick run through confirms all of those things. The app\'s 
UI has also been overhauled and is much more pleasing to look at now (and is 
actually simplified). This looks to be an all-around win.<img alt="" border="0" 
src="http://stats.wordpress.com/b.gif?host=techcrunch.com&blog=11718616&post=190
091&subd=tctechcrunch&ref=&feed=1" 
/>]]></description>\r\n\t\t\t<content:encoded><![CDATA[<p><img 
class="alignright size-full wp-image-190095" title="ya" 
src="http://tctechcrunch.files.wordpress.com/2010/06/ya.png?w=280&#038;h=420" 
alt="" width="280" height="420">At the risk of pissing off our new office <a 
href="http://techcrunch.com/2010/06/03/goodbye-palo-alto-techcrunch-moves-to-san
-francisco/">neighbors</a>, I have a confession to make: I loathed the Yammer 
iPhone app. Don&#8217;t get me wrong, I love <a 
href="http://yammer.com">Yammer</a>, and find it absolutely vital to our work. 
But the app was easily the least stable of the dozens of apps I have &nbsp;on 
my iPhone. It was so bad, in fact, that I&#8217;ve been accessing Yammer 
through mobile Safari in recent weeks. But that&#8217;s why I&#8217;m happy to 
announce that today, with the launch of the latest version of the app, 3.0, my 
nightmare is over.</p>\n<p>As they note in the App Store description, Yammer 
3.0 for iPhone is a complete re-write of the app. It promises to fix 
&#8220;many crashes,&#8221; load &#8220;much faster,&#8221; and even work on 
the upcoming iPhone 4. A quick run through confirms all of those things. The 
app&#8217;s UI has also been overhauled and is much more pleasing to look at 
now (and is actually simplified). This looks to be an all-around 
win.</p>\n<p>It also brings several smaller features such as: autocomplete for 
@replies in the app, full landscape support, and the ability to mail and call 
contacts right from within the app. This thing just made my job much easier. 
Nice job Yammer, you&#8217;ve earned your way back on to my main screen of 
apps.</p>\n<p><a 
href="http://itunes.apple.com/us/app/yammer/id289559439?mt=8">You can find the 
new Yammer app here</a>. It&#8217;s a free 
download.</p>\n<p><strong>Update</strong>: Fine, I&#8217;ll change the title to 
&#8220;fewer&#8221;.</p>\n<div class="cbw snap_nopreview"><div 
class="cbw_header"><script 
src="http://www.crunchbase.com/javascripts/widget.js" 
type="text/javascript"></script><div class="cbw_header_text"><a 
href="http://www.crunchbase.com/">CrunchBase Information</a></div></div><div 
class="cbw_content"><div class="cbw_subheader"><a 
href="http://www.crunchbase.com/company/yammer">Yammer</a></div><div 
class="cbw_subcontent"><script 
src="http://www.crunchbase.com/cbw/company/yammer.js" 
type="text/javascript"></script></div><div class="cbw_subheader"><a 
href="http://www.crunchbase.com/product/iphone">iPhone</a></div><div 
class="cbw_subcontent"><script 
src="http://www.crunchbase.com/cbw/product/iphone.js" 
type="text/javascript"></script></div><div class="cbw_footer">Information 
provided by <a 
href="http://www.crunchbase.com/">CrunchBase</a></div></div></div>\n<br />  <a 
rel="nofollow" 
href="http://feeds.wordpress.com/1.0/gocomments/tctechcrunch.wordpress.com/19009
1/"><img alt="" border="0" 
src="http://feeds.wordpress.com/1.0/comments/tctechcrunch.wordpress.com/190091/"
 /></a> <a rel="nofollow" 
href="http://feeds.wordpress.com/1.0/godelicious/tctechcrunch.wordpress.com/1900
91/"><img alt="" border="0" 
src="http://feeds.wordpress.com/1.0/delicious/tctechcrunch.wordpress.com/190091/
" /></a> <a rel="nofollow" 
href="http://feeds.wordpress.com/1.0/gostumble/tctechcrunch.wordpress.com/190091
/"><img alt="" border="0" 
src="http://feeds.wordpress.com/1.0/stumble/tctechcrunch.wordpress.com/190091/" 
/></a> <a rel="nofollow" 
href="http://feeds.wordpress.com/1.0/godigg/tctechcrunch.wordpress.com/190091/">
<img alt="" border="0" 
src="http://feeds.wordpress.com/1.0/digg/tctechcrunch.wordpress.com/190091/" 
/></a> <a rel="nofollow" 
href="http://feeds.wordpress.com/1.0/goreddit/tctechcrunch.wordpress.com/190091/
"><img alt="" border="0" 
src="http://feeds.wordpress.com/1.0/reddit/tctechcrunch.wordpress.com/190091/" 
/></a> <img alt="" border="0" 
src="http://stats.wordpress.com/b.gif?host=techcrunch.com&blog=11718616&post=190
091&subd=tctechcrunch&ref=&feed=1" /><p><a 
href="http://pro.tweetmeme.com/share?url=http://techcrunch.com/2010/06/16/yammer
-iphone/&style=compact&source=techcrunch&service=bit.ly&service_api=techcrunch:R
_0381170e330c42dda299f92709e0ef5c"><img 
src="http://pro.tweetmeme.com/imagebutton.gif?url=http://techcrunch.com/2010/06/
16/yammer-iphone/&style=compact&source=techcrunch&service=bit.ly" 
/></a></p>\n<p><a 
href="http://feedads.g.doubleclick.net/~at/g6tfdweNJCoQ2k3W2qcpyGGocHs/0/da"><im
g src="http://feedads.g.doubleclick.net/~at/g6tfdweNJCoQ2k3W2qcpyGGocHs/0/di" 
border="0" ismap="true"></img></a><br/>\n<a 
href="http://feedads.g.doubleclick.net/~at/g6tfdweNJCoQ2k3W2qcpyGGocHs/1/da"><im
g src="http://feedads.g.doubleclick.net/~at/g6tfdweNJCoQ2k3W2qcpyGGocHs/1/di" 
border="0" ismap="true"></img></a></p><div class="feedflare">\n<a 
href="http://feeds.feedburner.com/~ff/Techcrunch?a=vwM866Q1ZhQ:IdSWbW9Q1t8:2mJPE
YqXBVI"><img src="http://feeds.feedburner.com/~ff/Techcrunch?d=2mJPEYqXBVI" 
border="0"></img></a> <a 
href="http://feeds.feedburner.com/~ff/Techcrunch?a=vwM866Q1ZhQ:IdSWbW9Q1t8:7Q72W
NTAKBA"><img src="http://feeds.feedburner.com/~ff/Techcrunch?d=7Q72WNTAKBA" 
border="0"></img></a> <a 
href="http://feeds.feedburner.com/~ff/Techcrunch?a=vwM866Q1ZhQ:IdSWbW9Q1t8:yIl2A
UoC8zA"><img src="http://feeds.feedburner.com/~ff/Techcrunch?d=yIl2AUoC8zA" 
border="0"></img></a> <a 
href="http://feeds.feedburner.com/~ff/Techcrunch?a=vwM866Q1ZhQ:IdSWbW9Q1t8:-BTjW
OF_DHI"><img 
src="http://feeds.feedburner.com/~ff/Techcrunch?i=vwM866Q1ZhQ:IdSWbW9Q1t8:-BTjWO
F_DHI" border="0"></img></a> <a 
href="http://feeds.feedburner.com/~ff/Techcrunch?a=vwM866Q1ZhQ:IdSWbW9Q1t8:D7DqB
2pKExk"><img 
src="http://feeds.feedburner.com/~ff/Techcrunch?i=vwM866Q1ZhQ:IdSWbW9Q1t8:D7DqB2
pKExk" border="0"></img></a> <a 
href="http://feeds.feedburner.com/~ff/Techcrunch?a=vwM866Q1ZhQ:IdSWbW9Q1t8:qj6ID
K7rITs"><img src="http://feeds.feedburner.com/~ff/Techcrunch?d=qj6IDK7rITs" 
border="0"></img></a>\n</div><img 
src="http://feeds.feedburner.com/~r/Techcrunch/~4/vwM866Q1ZhQ" height="1" 
width="1"/>]]></content:encoded>\r\n\t\t\t<wfw:commentRss>http://techcrunch.com/
2010/06/16/yammer-iphone/feed/</wfw:commentRss>\r\n\t\t<slash:comments>0</slash:
comments>\r\n\t\r\n\t\t<media:content 
url="http://1.gravatar.com/avatar/710187cd963df0f92d11ddb31e6ae3db?s=96&amp;d=id
enticon&amp;r=G" medium="image">\r\n\t\t\t<media:title 
type="html">MG</media:title>\r\n\t\t</media:content>\r\n\r\n\t\t<media:content 
url="http://tctechcrunch.files.wordpress.com/2010/06/ya.png" 
medium="image">\r\n\t\t\t<media:title 
type="html">ya</media:title>\r\n\t\t</media:content>\r\n\t<feedburner:origLink>h
ttp://techcrunch.com/2010/06/16/yammer-iphone/</feedburner:origLink></item>

What version of the product are you using? On what operating system?
Python 2.6, Mac OS X 10.5.8, latest feedparser.py from trunk (initially was 
using whatever comes with python or whatever I'd downloaded from release)

Original issue reported on code.google.com by amar...@google.com on 16 Jun 2010 at 9:48

GoogleCodeExporter commented 9 years ago
This is what I get:

>>> import feedparser
>>> f=feedparser.parse('http://feedproxy.google.com/TechCrunch')
>>> for e in f.entries:
...     print e.title
... 
Muziic Has Streamed 250 Million Music Videos To Date, But Will It Last?
Guest Post: It’s Game On For Location Based Services
Facebook For iPhone Updated: No iOS 4 Support, No iPad Support, Broken UI
Facebook Movie Poster Announces 500 Million Facebook Users Before Facebook Does
The Best iOS 4-Ready Apps So Far
Foursquare CEO Crowley On Fundraising: “You Don’t Have To Rush Through 
It” (Video)
Appbistro Lands Wildfire For Its Facebook App Market
Scribd’s Decision To Dump Flash Pays Off, User Engagement Triples
The Poor, Pilloried, Tech IPO
A Guide To 3D Display Technology: Its Principles, Methods, And Dangers
Might Threaded Conversations Be Coming To Twitter?
Lijit Proves Search Company Really Means Ad Company – Takes $6 Million Series 
D
Use The iPhone 4′s Gyroscope Right Now — Without The iPhone 4 Or The 
Gyroscope
Pogoplug Updates Android App: Control Your Drives From Your EVO 4G
SGN’s 3D Shooter EXO-Planet Elite Comes To The iPhone
Foursquare Check-In Stickers Coming To A Store Window Near You (Video)
AdMob Deal Breakdown: $530 Million In Stock, $220 Million In Cash
Square Delays Mass Roll-Out, Admits They Began Before Things Were “Fully 
Baked”
SGN Takes Investment From Eric Schmidt’s Tomorrow Ventures
Twitter Tweaks The Fail Whale Based On TechCrunch Commenter Feedback
Lakers Victory Sets Twitter All-Time Record With 3,085 Tweets Per Second
Latest comScore Stats Show Twitter Growth Is Still Strong
OneRiot’s Realtime Search API Now Indexing Facebook Likes And Shared Content
Millennial Media: Apple OS Drops By 33 Percent In May But iPad Impressions Grow 
160 Percent
CrunchGear Reviews Toy Story 3

I get this result using the HEAD version of Feedparser as well as FeedParser 
4.1. Can you still reproduce this bug?

Original comment by a...@google.com on 20 Jun 2010 at 4:06

GoogleCodeExporter commented 9 years ago
Yes I can still reproduce this bug, see below.

Could the problem be stemming from a library required by feedparser, e.g. 
sgmllib, xml.sax?  For the latter I'm using version 0.8.4.  Not sure how to 
tell for the former.

amarpai@amarpai:~$ py
Python 2.6 (r26:66714, Nov  3 2008, 10:57:42) 
[GCC 4.1.0 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__version__
'4.1'
>>> f=feedparser.parse('http://feedproxy.google.com/TechCrunch')
>>> for e in f.entries:
...    print e.title
... 
john
jason
leena
gv
leena
leena
leena
robinw
erick
leena
robinw
leena
leena
robinw
steveohear
robinw
Screen shot 2010-06-21 at 10.42.37 PM
michael-arrington
erick
1b
mike-butcher
jason
devin
lo
Screen shot 2010-01-17 at 10.46

Original comment by amar...@google.com on 22 Jun 2010 at 5:34

GoogleCodeExporter commented 9 years ago
I had the same problem with version 4.1 - with version 4.2-pre-294-svn it works 
for me

Original comment by maz...@gmail.com on 21 Jul 2010 at 3:07

GoogleCodeExporter commented 9 years ago
Please close this bug.

I've tested using svn trunk and the URL provided.

@amarpai: Please download the latest version of feedparser from svn trunk [1]. 
This should fix your problem. If not, you might double-check your environment 
and make sure that an old version of feedparser isn't being imported instead of 
the svn version. If you're having a problem, try running this code:

import feedparser
print feedparser.__file__

It's normal if the filename ends in .pyc instead of .py, but the path is what's 
important.

[1]: https://feedparser.googlecode.com/svn/trunk/feedparser/feedparser.py

Original comment by kurtmckee on 4 Dec 2010 at 4:44

GoogleCodeExporter commented 9 years ago

Original comment by adewale on 4 Dec 2010 at 10:16