[ 1572566 ] have bug when parse title and dc:title

GoogleCodeExporter commented 9 years ago

there is a bug when parse feed that have title and
dc:title.
if have dc:title after title in a feed,the value of
dc:title will replace title's value. But title's value
is information that we want, so that when we want use
feed.get("title","") to get title,however we get the
value of dc:title.
for example parse
"http://ajaxcn.org/exec/rss?snip=start":
feed.get("title","") get "start",but we want is "Ajax
中国"
- <channel>
<title>Ajax中国</title>
<link>http://ajaxcn.org/space/start</link>
<description>Ajax lead the way!</description>
<dc:creator>dlee</dc:creator>
<dc:type>Text</dc:type>
<dc:title>start</dc:title>
<dc:identifier>http://ajaxcn.org/space/start</dc:identifier>

<dc:date>2006-08-26T14:41:05+08:00</dc:date>
<dc:language>zh</dc:language>
- <!--
<blogChannel:changes>http://www.weblogs.com/rssUpdates/changes.xml</changes
>

-->
<admin:generatorAgent
rdf:resource="http://www.snipsnap.org/space/version-1.0b3-uttoxeter"
/>

Original issue reported on code.google.com by pilg...@gmail.com on 19 Apr 2007 at 5:14

Merged into: #76

Attachments:

rss.xml

GoogleCodeExporter commented 9 years ago

I've come across a similar issue. A media:title value will overwrite the title 
value.
See attached bug.xml feed to reproduce.

FYI, issue #61 is also the same as this one.

I've attached a patch that sets the _start_dc_title and _start_media_title 
handlers
to a new method, _start_title_low_pri, which attempts to set the title only if 
it
hasn't been set yet.

The patch works with the attached feed, fixes the original issue described in 
this
report, and also passes the feedparser-tests suite (well, except for the 2 
encoding
tests that fail even with included 4.1 release).

Thanks for your work, Mark!

Original comment by j...@codefork.com on 28 Oct 2007 at 7:42

Attachments:

GoogleCodeExporter commented 9 years ago

Here's a much smaller (non-validating) test case:

<rss version="2.0">
<channel>
<item>
<title>Test</title>
<media:title type="plain">Test</media:title>
</item>
</channel>
</rss>

Unpatched the title ends up a garbled mess:

>>> import feedparser
>>> parsed = feedparser.parse('http://localhost/test.xml')
>>> print parsed['items'][0]['title']
Mï¿½-

The patch above fixes it, thanks!

Original comment by feedsa...@gmail.com on 1 Nov 2007 at 11:56

GoogleCodeExporter commented 9 years ago

I have the same issue and can confirm that the patch fixed it. Thanks!

Original comment by dre...@gmail.com on 3 May 2008 at 9:29

GoogleCodeExporter commented 9 years ago

The patch no longer applies, but at least the media:title portion of this seems 
to be
fixed in SVN anyway (I'm testing at revision 287).

Original comment by mary-goo...@puzzling.org on 25 Jul 2008 at 9:04

GoogleCodeExporter commented 9 years ago

Hi all,

I am fairly new to Python and the processing of feeds. FeedParser is great and 
I am
glad it's around. However I have been great problems with a WordPress Atom feed 
that
just would not parse correctly. Happily the patch works a treat and I can stop
peering at code thinking I had done something wrong!

Thanks everyone

Bob

Original comment by BobToo...@gmail.com on 21 May 2009 at 11:44

GoogleCodeExporter commented 9 years ago

Please verify your bug against the current version of the code in SVN and 
comment on: 
http://code.google.com/p/feedparser/issues/detail?id=76 instead.

Original comment by adewale on 16 Dec 2009 at 2:34

Changed state: Duplicate

jsumners / feedparser

[ 1572566 ] have bug when parse title and dc:title #18