kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.93k stars 340 forks source link

Namespace attrs overwritten by dupekeys? #39

Open miigotu opened 8 years ago

miigotu commented 8 years ago

I have an issue I cant seem to work out on my own, not sure if it is a bug or if it is my fail on finding a workaround. Maybe you can point me in the right direction..

In the following xml, parsed output only includes the last of the torznab:attr elements. The torznab xmlns does not resolve, which I think doesn't matter from looking at the code, since it seems external xmlns resolution is disabled by default.

Is this a bug or is there a way for me to get the value of all 4 torznab:attr somehow?

<?xml version="1.0" encoding="UTF-8"?>
<rss version="1.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:torznab="http://torznab.com/schemas/2015/feed">
  <channel>
    <atom:link href="http://127.0.0.1:9117/" rel="self" type="application/rss+xml" />
    <title>TORZNAB</title>
    <description>TORZNAB</description>
    <link>https://torznab.org/</link>
    <lanuage>en-us</lanuage>
    <category>search</category>
    <image>
      <url>http://127.0.0.1:9117/logos/TORZNAB.png</url>
      <title>TORZNAB</title>
      <link>https://torznab.org/</link>
      <description>TORZNAB</description>
    </image>
    <item>
      <title>Ubuntu.14.10.Desktop.64bit.ISO</title>
      <guid>https://torznab.org/B415C913643E5FF49FE37D304BBB5E6E11AD5101/comments</guid>
      <comments>https://torznab.org/B415C913643E5FF49FE37D304BBB5E6E11AD5101/comments</comments>
      <pubDate>Sat, 06 Jul 2013 03:57:49 -0700</pubDate>
      <size>1159641169</size>
      <description>Ubuntu.14.10.Desktop.64bit.ISO</description>
      <link>magnet:?xt=urn:btih:B415C913643E5FF49FE37D304BBB5E6E11AD5101&amp;dn=ubuntu+14+10+desktop+64bit+iso&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fopen.demonii.com%3A1337</link>
      <category>4020</category>
      <enclosure url="magnet:?xt=urn:btih:B415C913643E5FF49FE37D304BBB5E6E11AD5101&amp;dn=ubuntu+14+10+desktop+64bit+iso&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fopen.demonii.com%3A1337" length="253217700" type="application/x-bittorrent" />
      <torznab:attr name="magneturl" value="magnet:?xt=urn:btih:B415C913643E5FF49FE37D304BBB5E6E11AD5101&amp;dn=ubuntu+14+10+desktop+64bit+iso&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fopen.demonii.com%3A1337" />
      <torznab:attr name="seeders" value="115" />
      <torznab:attr name="peers" value="8" />
      <torznab:attr name="infohash" value="B415C913643E5FF49FE37D304BBB5E6E11AD5101" />
    </item>
  </channel>
</rss>
miigotu commented 8 years ago

Same with non-namespace nodes, with http://lolo.sickbeard.com/api?t=caps I only get back category 8000, but I seem to get all of the subcats for the category it does return even if there is more than 1

Andy2244 commented 7 years ago

Having the same problem with newznab api

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/">
<channel>
    <title>example.com</title>
    <description>example.com API results</description>
    <!--
      More RSS content
    -->

    <!-- offset is the current offset of the response
         total is the total number of items found by the query
    -->
    <newznab:response offset="0" total="1234"/>

    <item>
      <!-- Standard RSS 2.0 data -->
      <title>A.Public.Domain.Album.Name</title>
      <guid isPermaLink="true">http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c</guid>
      <link>http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&amp;i=1&amp;r=18cf9f0a736041465e3bd521d00a90b9</link>
      <comments>http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c#comments</comments>
      <pubDate>Sun, 06 Jun 2010 17:29:23 +0100</pubDate>
      <category>Music > MP3</category>
      <description>Some music</description>
      <enclosure url="http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&amp;i=1&amp;r=18cf9f0a736041465e3bd521d00a90b9" length="154653309" type="application/x-nzb" />

      <!-- Additional attributes -->
      <newznab:attr name="category" value="3000" />
      <newznab:attr name="category" value="3010" />
      <newznab:attr name="size"     value="144967295" />
      <newznab:attr name="artist"   value="Bob Smith" />
      <newznab:attr name="album"    value="Groovy Tunes" />
      <newznab:attr name="publisher" value="Epic Music" />
      <newznab:attr name="year"     value="2011" />
      <newznab:attr name="tracks"   value="track one|track two|track three" />
      <newznab:attr name="coverurl" value="http://servername.com/covers/music/12345.jpg" />
      <newznab:attr name="review"   value="This album is great" />
    </item>

</channel>
</rss>

All i get is the 'newznab' namespace populated with 'attr' and a single 'category' node. I checked the debug object and all other tags are simply lost by feedparser. In contrast xml.minidom will give me a list of all nodes if i do 'dom.getElementsByTagNameNS'. From a quick look the formating is within the W3C specs.

Given the issue is over a year old i assume feedparser development has been halted?

Safihre commented 7 years ago

Yeah same here, also want to parse newznab attributes but feedparser won't let me. I guess I will have to create my own parser and all the pain that comes with it. @Andy2244 might want to check out Python's xml.etree.cElementTree, it's blazing fast in parsing XML

For example:

from urllib2 import urlopen
import xml.etree.cElementTree as ET

rss_data = urlopen("https://api.nzbgeek.info/rss?t=2000&dl=1&num=200&r=xx")

tree = ET.parse(rss_data)
root = tree.getroot()

# Need to define namespaces
ns = {'newznab': 'http://www.newznab.com/DTD/2010/feeds/attributes/',
      'nZEDb': 'http://www.newznab.com/DTD/2010/feeds/attributes/'}

for item in root.findall('*item', ns):
    b = item.find("newznab:attr[@name='size']", ns) or item.find("nZEDb:attr[@name='size']", ns)
    print b.get('value')
Andy2244 commented 7 years ago

Here is what i'm doing as a quick fix, since i generally like the ease of use of feedparser.

NAMESPACE_NAME = 'newznab'
NAMESPACE_URL = 'http://www.newznab.com/DTD/2010/feeds/attributes/'
NAMESPACE_TAGNAME = 'attr'

    # feedparser cant handle namespace attributes with same tagname, so rename those nodes.
    def make_feedparser_friendly(self, data):
        try:
            dom = minidom.parseString(data)
            items_ns = dom.getElementsByTagNameNS(NAMESPACE_URL, NAMESPACE_TAGNAME)
            if items_ns:
                for node in items_ns:
                    if node.attributes and 'name' in node.attributes and 'value' in node.attributes:
                        node.tagName = NAMESPACE_NAME + ':%s' % node.attributes['name'].value
                        node.name = node.attributes['name'].value
                        node.value = node.attributes['value'].value
        except Exception as ex:
            log.trace('Unable to rename nodes in XML: %s' % ex)
            return None
        return dom.toxml()
miigotu commented 7 years ago

I didtched feedparser altogether due to this, because it suited my application. Now I parse both xml and html pages using bs4.

Andy2244 commented 7 years ago

Yeah i checked BS, but the xml part depends on a external working lxml, which is a pain to install on windows. Thats why i picked minidom and just hotfix the xml namespace attributes.

miigotu commented 7 years ago

You dont need lxml, I use html5lib as the parser for everything.

miigotu commented 7 years ago

This is pretty hacky, but it works to use feedparser without having to double parse data. Im sure it can be improved and made generic to auto convert values to lists when an overwrite would occur, but this is good enough for me for now.:

diff --git a/lib/feedparser/api.py b/lib/feedparser/api.py
index 614bd2d..12eafd2 100644
--- a/lib/feedparser/api.py
+++ b/lib/feedparser/api.py
@@ -60,6 +60,7 @@ from .sanitizer import replace_doctype
 from .sgml import *
 from .urls import _convert_to_idn, _makeSafeAbsoluteURI
 from .util import FeedParserDict
+from . import USER_AGENT

 bytes_ = type(b'')
 unicode_ = type('')
diff --git a/lib/feedparser/util.py b/lib/feedparser/util.py
index f7c02c0..df36b3e 100644
--- a/lib/feedparser/util.py
+++ b/lib/feedparser/util.py
@@ -122,9 +122,23 @@ class FeedParserDict(dict):

     def __setitem__(self, key, value):
         key = self.keymap.get(key, key)
-        if isinstance(key, list):
-            key = key[0]
-        return dict.__setitem__(self, key, value)
+        if key == 'newznab_attr':
+            if isinstance(value, dict) and value.keys() == ['name', 'value']:
+                key = value['name']
+                value = value['value']
+
+            if not dict.__contains__(self, 'categories'):
+                dict.__setitem__(self, 'categories', [])
+
+            if key == 'category':
+                self['categories'].append(value)
+            else:
+                dict.__setitem__(self, key, value)
+        else:
+            if isinstance(key, list):
+                key = key[0]
+
+            return dict.__setitem__(self, key, value)

     def setdefault(self, key, value):
         if key not in self: