Open miigotu opened 8 years ago
Same with non-namespace nodes, with http://lolo.sickbeard.com/api?t=caps I only get back category 8000, but I seem to get all of the subcats for the category it does return even if there is more than 1
Having the same problem with newznab api
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/">
<channel>
<title>example.com</title>
<description>example.com API results</description>
<!--
More RSS content
-->
<!-- offset is the current offset of the response
total is the total number of items found by the query
-->
<newznab:response offset="0" total="1234"/>
<item>
<!-- Standard RSS 2.0 data -->
<title>A.Public.Domain.Album.Name</title>
<guid isPermaLink="true">http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c</guid>
<link>http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&i=1&r=18cf9f0a736041465e3bd521d00a90b9</link>
<comments>http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c#comments</comments>
<pubDate>Sun, 06 Jun 2010 17:29:23 +0100</pubDate>
<category>Music > MP3</category>
<description>Some music</description>
<enclosure url="http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&i=1&r=18cf9f0a736041465e3bd521d00a90b9" length="154653309" type="application/x-nzb" />
<!-- Additional attributes -->
<newznab:attr name="category" value="3000" />
<newznab:attr name="category" value="3010" />
<newznab:attr name="size" value="144967295" />
<newznab:attr name="artist" value="Bob Smith" />
<newznab:attr name="album" value="Groovy Tunes" />
<newznab:attr name="publisher" value="Epic Music" />
<newznab:attr name="year" value="2011" />
<newznab:attr name="tracks" value="track one|track two|track three" />
<newznab:attr name="coverurl" value="http://servername.com/covers/music/12345.jpg" />
<newznab:attr name="review" value="This album is great" />
</item>
</channel>
</rss>
All i get is the 'newznab' namespace populated with 'attr' and a single 'category' node. I checked the debug object and all other tags are simply lost by feedparser. In contrast xml.minidom will give me a list of all nodes if i do 'dom.getElementsByTagNameNS'. From a quick look the formating is within the W3C specs.
Given the issue is over a year old i assume feedparser development has been halted?
Yeah same here, also want to parse newznab
attributes but feedparser won't let me.
I guess I will have to create my own parser and all the pain that comes with it.
@Andy2244 might want to check out Python's xml.etree.cElementTree
, it's blazing fast in parsing XML
For example:
from urllib2 import urlopen
import xml.etree.cElementTree as ET
rss_data = urlopen("https://api.nzbgeek.info/rss?t=2000&dl=1&num=200&r=xx")
tree = ET.parse(rss_data)
root = tree.getroot()
# Need to define namespaces
ns = {'newznab': 'http://www.newznab.com/DTD/2010/feeds/attributes/',
'nZEDb': 'http://www.newznab.com/DTD/2010/feeds/attributes/'}
for item in root.findall('*item', ns):
b = item.find("newznab:attr[@name='size']", ns) or item.find("nZEDb:attr[@name='size']", ns)
print b.get('value')
Here is what i'm doing as a quick fix, since i generally like the ease of use of feedparser.
NAMESPACE_NAME = 'newznab'
NAMESPACE_URL = 'http://www.newznab.com/DTD/2010/feeds/attributes/'
NAMESPACE_TAGNAME = 'attr'
# feedparser cant handle namespace attributes with same tagname, so rename those nodes.
def make_feedparser_friendly(self, data):
try:
dom = minidom.parseString(data)
items_ns = dom.getElementsByTagNameNS(NAMESPACE_URL, NAMESPACE_TAGNAME)
if items_ns:
for node in items_ns:
if node.attributes and 'name' in node.attributes and 'value' in node.attributes:
node.tagName = NAMESPACE_NAME + ':%s' % node.attributes['name'].value
node.name = node.attributes['name'].value
node.value = node.attributes['value'].value
except Exception as ex:
log.trace('Unable to rename nodes in XML: %s' % ex)
return None
return dom.toxml()
I didtched feedparser altogether due to this, because it suited my application. Now I parse both xml and html pages using bs4.
Yeah i checked BS, but the xml part depends on a external working lxml, which is a pain to install on windows. Thats why i picked minidom and just hotfix the xml namespace attributes.
You dont need lxml, I use html5lib as the parser for everything.
This is pretty hacky, but it works to use feedparser without having to double parse data. Im sure it can be improved and made generic to auto convert values to lists when an overwrite would occur, but this is good enough for me for now.:
diff --git a/lib/feedparser/api.py b/lib/feedparser/api.py
index 614bd2d..12eafd2 100644
--- a/lib/feedparser/api.py
+++ b/lib/feedparser/api.py
@@ -60,6 +60,7 @@ from .sanitizer import replace_doctype
from .sgml import *
from .urls import _convert_to_idn, _makeSafeAbsoluteURI
from .util import FeedParserDict
+from . import USER_AGENT
bytes_ = type(b'')
unicode_ = type('')
diff --git a/lib/feedparser/util.py b/lib/feedparser/util.py
index f7c02c0..df36b3e 100644
--- a/lib/feedparser/util.py
+++ b/lib/feedparser/util.py
@@ -122,9 +122,23 @@ class FeedParserDict(dict):
def __setitem__(self, key, value):
key = self.keymap.get(key, key)
- if isinstance(key, list):
- key = key[0]
- return dict.__setitem__(self, key, value)
+ if key == 'newznab_attr':
+ if isinstance(value, dict) and value.keys() == ['name', 'value']:
+ key = value['name']
+ value = value['value']
+
+ if not dict.__contains__(self, 'categories'):
+ dict.__setitem__(self, 'categories', [])
+
+ if key == 'category':
+ self['categories'].append(value)
+ else:
+ dict.__setitem__(self, key, value)
+ else:
+ if isinstance(key, list):
+ key = key[0]
+
+ return dict.__setitem__(self, key, value)
def setdefault(self, key, value):
if key not in self:
I have an issue I cant seem to work out on my own, not sure if it is a bug or if it is my fail on finding a workaround. Maybe you can point me in the right direction..
In the following xml, parsed output only includes the last of the
torznab:attr
elements. The torznab xmlns does not resolve, which I think doesn't matter from looking at the code, since it seems external xmlns resolution is disabled by default.Is this a bug or is there a way for me to get the value of all 4 torznab:attr somehow?