kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.99k stars 343 forks source link

Debian's UDD feeds freak out feedparser #112

Open anarcat opened 7 years ago

anarcat commented 7 years ago

My personal UDD todo list breaks feedparser. If you add the tests to the "illformed" directory, tox says:

GLOB sdist-make: /home/anarcat/dist/feedparser/setup.py
py27 create: /home/anarcat/dist/feedparser/.tox/py27
py27 inst: /home/anarcat/dist/feedparser/.tox/dist/feedparser-5.2.1.zip
py27 installed: feedparser==5.2.1,pkg-resources==0.0.0
py27 runtests: PYTHONHASHSEED='1353716627'
py27 runtests: commands[0] | /home/anarcat/dist/feedparser/.tox/py27/bin/python tests/runtests.py
Traceback (most recent call last):
  File "tests/runtests.py", line 835, in <module>
    runtests()
  File "tests/runtests.py", line 789, in runtests
    description, evalString, skipUnless = getDescription(xmlfile, data)
  File "tests/runtests.py", line 740, in getDescription
    raise RuntimeError("can't parse %s" % xmlfile)
RuntimeError: can't parse ./tests/illformed/udd.xml
ERROR: InvocationError: '/home/anarcat/dist/feedparser/.tox/py27/bin/python tests/runtests.py'
py35 create: /home/anarcat/dist/feedparser/.tox/py35
py35 inst: /home/anarcat/dist/feedparser/.tox/dist/feedparser-5.2.1.zip
py35 installed: feedparser==5.2.1,pkg-resources==0.0.0,sgmllib3k==1.0.0
py35 runtests: PYTHONHASHSEED='1353716627'
py35 runtests: commands[0] | /home/anarcat/dist/feedparser/.tox/py35/bin/python tests/runtests.py
Traceback (most recent call last):
  File "tests/runtests.py", line 835, in <module>
    runtests()
  File "tests/runtests.py", line 789, in runtests
    description, evalString, skipUnless = getDescription(xmlfile, data)
  File "tests/runtests.py", line 740, in getDescription
    raise RuntimeError("can't parse %s" % xmlfile)
RuntimeError: can't parse ./tests/illformed/udd.xml
ERROR: InvocationError: '/home/anarcat/dist/feedparser/.tox/py35/bin/python tests/runtests.py'
_______________________________________________________________________________ summary ________________________________________________________________________________
ERROR:   py27: commands failed
ERROR:   py35: commands failed

the problem seems to be there is no guid field and an empty link field on some entries, which breaks (reasonable) expectations from feedparser...

twm commented 6 years ago

What behavior do you expect from feedparser in this case? Should the invalid entries be silently ignored? Should feedparser produce entries without a link?

Maybe UDD should be fixed? That feed is not valid.

anarcat commented 6 years ago

it should:

  1. not crash

  2. make an educated guess at a UID

I do this in feed2exec:

        if not item.get('id'):
            item['id'] = item.get('title')

it's just a dumb heuristic, but it works better than crashing on an arbitrary feed.

at the very least, i would want feedparser to be robust (ie. not crash) on bad content. delivering a non-empty feed is extra...

twm commented 6 years ago

Hmm, that heuristic would work in this particular case but in the wild repeated entry titles are pretty common (e.g., http://www.pusheen.com/rss) so I wouldn't want it built into feedparser except on an opt-in basis. As a feedparser user I'd rather have no ID than a heuristic that I can't fix.

My first inclination for a heuristic would have been to use the item date as a final fall-back, but that doesn't work for this feed either. :-/ So maybe skipping 'id' or making it the empty string is best in this case. Then you can add heuristics on top (e.g., a more robust one would be to hash all the item fields in cases like this).

anarcat commented 6 years ago

yep, i don't mind rolling my own heuristics here... i guess what i need here is for feedparser to ... er... not crash. :)

kurtmckee commented 6 years ago

@anarcat, are you still seeing this behavior? If so, I'll jump in on this and work to get feedparser to quit crashing.

Re: GUID heuristics, feedparser won't be updated to inject GUID's but you're right, feedparser shouldn't be crashing!! =)

anarcat commented 6 years ago

i still get the same error than originally reported. should i send a PR to get the failing unit test in place?

to reproduce, you simply need to do this:

wget -O tests/illformed/udd.xml 'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'

and run the test suite.

kurtmckee commented 6 years ago

Perfect, I'll try to get this fixed.

On May 7, 2018 1:36:10 PM UTC, anarcat notifications@github.com wrote:

i still get the same error than originally reported. should i send a PR to get the failing unit test in place?

to reproduce, you simply need to do this:

wget -O tests/illformed/udd.xml
'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'

and run the test suite.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kurtmckee/feedparser/issues/112#issuecomment-387066806

buhtz commented 5 years ago

FYI: There is also another problem with debian related feeds. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926074

Please open a bug report on for Debian against the tracker.debian.org package and post the link here. Thanks.