Open anarcat opened 7 years ago
What behavior do you expect from feedparser in this case? Should the invalid entries be silently ignored? Should feedparser produce entries without a link?
Maybe UDD should be fixed? That feed is not valid.
it should:
not crash
make an educated guess at a UID
I do this in feed2exec:
if not item.get('id'):
item['id'] = item.get('title')
it's just a dumb heuristic, but it works better than crashing on an arbitrary feed.
at the very least, i would want feedparser to be robust (ie. not crash) on bad content. delivering a non-empty feed is extra...
Hmm, that heuristic would work in this particular case but in the wild repeated entry titles are pretty common (e.g., http://www.pusheen.com/rss) so I wouldn't want it built into feedparser except on an opt-in basis. As a feedparser user I'd rather have no ID than a heuristic that I can't fix.
My first inclination for a heuristic would have been to use the item date as a final fall-back, but that doesn't work for this feed either. :-/ So maybe skipping 'id'
or making it the empty string is best in this case. Then you can add heuristics on top (e.g., a more robust one would be to hash all the item fields in cases like this).
yep, i don't mind rolling my own heuristics here... i guess what i need here is for feedparser to ... er... not crash. :)
@anarcat, are you still seeing this behavior? If so, I'll jump in on this and work to get feedparser to quit crashing.
Re: GUID heuristics, feedparser won't be updated to inject GUID's but you're right, feedparser shouldn't be crashing!! =)
i still get the same error than originally reported. should i send a PR to get the failing unit test in place?
to reproduce, you simply need to do this:
wget -O tests/illformed/udd.xml 'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'
and run the test suite.
Perfect, I'll try to get this fixed.
On May 7, 2018 1:36:10 PM UTC, anarcat notifications@github.com wrote:
i still get the same error than originally reported. should i send a PR to get the failing unit test in place?
to reproduce, you simply need to do this:
wget -O tests/illformed/udd.xml 'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'
and run the test suite.
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kurtmckee/feedparser/issues/112#issuecomment-387066806
FYI: There is also another problem with debian related feeds. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926074
Please open a bug report on for Debian against the tracker.debian.org
package and post the link here. Thanks.
My personal UDD todo list breaks feedparser. If you add the tests to the "illformed" directory, tox says:
the problem seems to be there is no
guid
field and an emptylink
field on some entries, which breaks (reasonable) expectations from feedparser...