libo26 / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

text of namespaced elements with attributes is discarded #256

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Using 5.0 and this feed http://xurrency.com/gbp/feed I can't get the value of 
<dc:value>. This was working fine in 4.1.

f = feedparser.parse('http://xurrency.com/gbp/feed')
e = f['entries'][0]
print type(e['dc_value'])
> <type 'dict'>
print e['dc_value']
> {'decimal': u'4', 'frequency': u'daily'}
print e['dc_value'].value
> AttributeError: 'dict' object has no attribute 'value'

Do the docs need to be updated and you're now meant to do this differently or 
is something else wrong?

Original issue reported on code.google.com by kylemacfarlane@gmail.com on 20 Feb 2011 at 10:21

GoogleCodeExporter commented 9 years ago
I downloaded the stock 4.1 release and tried the code you listed above. 
'dc_value' is a string, not a dict, so while 4.1 gives you the content, you get 
none of the element's attributes. I then tried the same thing with svn trunk 
and found that the element attributes are available (exactly as you noted 
above) but the element content isn't.

feedparser 4.1 was released five years ago, and the namespace code changed over 
time. The documentation has to be updated, no question, but I'll review the 
current behavior and the old documentation and see if there's an obvious 
solution. In the mean time, you may be able to mitigate the problem by forcing 
feedparser's behavior to your liking using code similar to:

import feedparser

# this will override unknown element behavior
# include the 'self' parameter
def start_dc_value(self, attrsD):
    self.pushContent('dc_value', attrsD, 'text/plain', 1)

def end_dc_value(self):
    value = self.popContent('dc_value')
    context = self._getContext()
    context['dc_value'] = value

# insert the new functions to override current behavior
feedparser._FeedParserMixin._start_dc_value = start_dc_value
feedparser._FeedParserMixin._end_dc_value = end_dc_value

f = feedparser.parse('http://xurrency.com/gbp/feed')
e = f.entries[0]
print e['dc_value'] # prints a string like '1.1869'

Original comment by kurtmckee on 20 Feb 2011 at 11:32

GoogleCodeExporter commented 9 years ago
I have the same issue when parsing an equivalent RSS feed. I have something like

<tag attr1="foo" attr2="bar">baz</tag>

And I want to be able to read the attrs and the content (baz).

The fix you provided did not seem to change anything. I did a json dump of the 
return value of feedparser.parse both before and after overriding the behavior 
and diffed them, and there was no diff.

(I'm using 5.0.1)

Original comment by danj...@gmail.com on 5 Sep 2011 at 6:41

GoogleCodeExporter commented 9 years ago
Oh, so I was in blind copy-and-paste mode and didn't realize that that was 
actually particular to the name of his tag. Now that I did 
s/dc_value/my_tag_name/ it worked. But I have several tags like this I want to 
fix. Is there no general solution?

Original comment by danj...@gmail.com on 5 Sep 2011 at 6:47

GoogleCodeExporter commented 9 years ago
I haven't determined the best way to handle this yet, so there isn't yet a 
convenient general solution.

Original comment by kurtmckee on 6 Sep 2011 at 3:14

GoogleCodeExporter commented 9 years ago
Issue 301 has been merged into this issue.

Original comment by kurtmckee on 9 Sep 2011 at 2:06

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 9 Sep 2011 at 2:07

GoogleCodeExporter commented 9 years ago
I am the same issue.

http://itunes.apple.com/us/rss/topfreeapplications/limit=10/xml

Line like this

<im:artist 
href="http://itunes.apple.com/us/artist/fluik/id341885018?mt=8&uo=2">Fluik</im:a
rtist>

is fucked up, I can't get "Fluik" from this xml element.

Original comment by electron...@gmail.com on 26 Sep 2011 at 6:53

GoogleCodeExporter commented 9 years ago
Issue 341 has been merged into this issue.

Original comment by kurtmckee on 9 Apr 2012 at 4:27

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I'm having issues with this as well. 
Would like a fix for this.

Original comment by kwh...@gmail.com on 12 Dec 2012 at 8:02

GoogleCodeExporter commented 9 years ago
Issue 420 has been merged into this issue.

Original comment by kurtmckee on 10 Jul 2014 at 4:42