pombreda / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Wrong "description" key for the entry when itunes tag is present #242

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. feed = feedparser.parse('http://www.dailymotion.com/rss/fr/visited-today/1')
2. feed.entries[0].get("description")
u'Images du tournage de la nouvelle pub Gillette'

What is the expected output? What do you see instead?
Each entry has a <description> tag, but the feed.entries[X].get("description") 
returns the itunes:summary instead of the CDATA of the description tag. 
The expected behavior would be the one in "summary" key:

Out[20]: u'<a 
href="http://www.dailymotion.com/video/xfzr7n_jcvd-pete-les-plombs_shortfilms"><
img align="right" height="90" 
src="http://ak2.static.dailymotion.com/static/video/764/268/26862467:jpeg_previe
w_medium.jpg?20101207165630" width="120" /></a><p>Images du tournage de la 
nouvelle pub Gillette</p><p>Author: <a 
href="http://www.dailymotion.com/jcvd2711">< ....... (too long)

What version of the product are you using? On what operating system?
I'm using the trunk version 12.01.2010 on Ubuntu lucid. Python 2.6.5.

Please provide any additional information below.
Apart from the wrong behavior of the description, why is the content entries 
taken from the itunes tag instead of the description?

Original issue reported on code.google.com by carlos.m...@gmail.com on 9 Dec 2010 at 11:48

GoogleCodeExporter commented 9 years ago
I'm attaching a sample document that illustrates the lax handling of 
itunes:summary and description. The problem you're seeing stems from 
description and itunes:summary being handled differently based on which appears 
first in the document, and the sample I've attached illustrates that.

By the way, 'description' is an alias key; it doesn't actually exist. It first 
checks if there's a 'subtitle', and if not, it falls back on 'summary'. I 
recommend relying on 'subtitle' and 'summary' so that you know what you're 
going to get (...on the assumption that the order of the elements in the 
document doesn't affect the result!).

Original comment by kurtmckee on 9 Dec 2010 at 6:16

Attachments:

GoogleCodeExporter commented 9 years ago
After reviewing the RSS 2.0 and Atom specifications, the iTunes spec, and the 
code, I've created and attached a patch that fixes the behavior I noted in the 
above comment. It will not, however, stop the `itunes:summary` from being 
loaded into the `summary` key (for which the `description` key is an alias). I 
consider this correct behavior if the publisher is purposefully adding elements 
from the itunes namespace.

Currently, when the code encounters a `description` or `summary` element it 
checks if there's already a `summary` key. If there is, it puts the data in the 
`content` key. This is obviously a purposeful design decision, and there are 
four unit tests that check this behavior. `itunes:summary` elements are treated 
the same as `summary` elements in the code, which is why they're affected by 
this design decision as well.

There are three options available: (1) create a dedicated method to deal with 
`itunes:summary` elements to guarantee they only appear in the `summary` key of 
the result dictionary, (2) remove the two function definition lines so 
`itunes:summary` is stored in the `itunes_summary` key, or (3) remove the 
data-shifting behavior. I've opted for the third based on the specs' 
description of the elements:

`description`: "The item synopsis." [1]
`summary`: "Conveys a short summary, abstract, or excerpt of the entry." [2]
`itunes:summary`: "This field can be up to 4000 characters. If <itunes:summary> 
is not included, the contents of the <description> tag are used." [3]

It seems to me that all three should be treated in the same manner and placed 
in the `summary` key, which is what the patch I'm attaching does. Here's the 
list of unit tests that can be removed with this patch:

  illformed/rss/item_description_and_summary.xml
  illformed/rss/item_summary_and_description.xml
  wellformed/rss/item_description_and_summary.xml
  wellformed/rss/item_summary_and_description.xml

Tested in Python 2.4 through 3.1, git branch at:
  https://github.com/kurtmckee/feedparser/tree/issue242

[1]: http://cyber.law.harvard.edu/rss/rss.html#hrelementsOfLtitemgt
[2]: http://www.atomenabled.org/developers/syndication/#recommendedEntryElements
[3]: http://www.apple.com/itunes/podcasts/specs.html#summary

Original comment by kurtmckee on 3 Jan 2011 at 8:50

Attachments:

GoogleCodeExporter commented 9 years ago
@Ade: I don't know if I expressed my opinion above, but I think that putting 
the `itunes:summary` element in the `summary` key is correct behavior here. If 
you agree, then aside from considering the patch above, this report can be 
closed as wontfix.

@Carlos: If this report gets closed as wontfix, don't despair! After the next 
release, I'm going to create an experimental git branch based on a blog entry I 
wrote [1]. The changes would allow you to customize how the `itunes:summary` 
element is handled. It would only be experimental, but it may be helpful to you.

[1]: http://kurtmckee.livejournal.com/32124.html

Original comment by kurtmckee on 3 Jan 2011 at 8:59

GoogleCodeExporter commented 9 years ago
I believe the code is doing the right thing here so I'm marking this as WontFix.
I think the data Carlos wants is available from: feed.entries[0].summary and 
that seems like a simpler solution.

Original comment by adewale on 4 Jan 2011 at 3:34