kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.97k stars 342 forks source link

Problems with heise newsfeed #189

Closed buhtz closed 5 years ago

buhtz commented 5 years ago

This not really a bug report agains feedparser. This is more a question to the community how to handle that special freaky weird newsfeed. I contacted the admins and also some other person from Heise but without any reaction. I am suprised that heise ignores something like this.

The modified-date of the feed does not behave as expected. After waiting some minutes the date has no effect anymore and the server response with the full set of entries - duplicates in this case. To illustrate this please see this Python session.

Python 3.7.3 (default, Apr 3 2019, 05:39:12)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser as fp
>>> url = 'https://www.heise.de/rss/heise-atom.xml'
>>> import datetime
>>> str(datetime.datetime.now())
'2019-09-21 21:42:55.583391'
>>>
>>> feed_a = fp.parse(url)
>>> len(feed_a.entries)
68
>>> feed_b = fp.parse(url, modified=feed_a.modified)
>>> len(feed_b.entries)
0
>>> str(datetime.datetime.now())
'2019-09-21 21:43:41.353641'

As expected 68 entries after the first request to the feed and 0 from the second request if the modified date from the first request is used. Fine.

But waiting round about 10 minutes.

>>> str(datetime.datetime.now())
'2019-09-21 21:53:16.371892'
>>> feed_c = fp.parse(url, modified=feed_b.modified)
>>> len(feed_c.entries)
68
>>> str(datetime.datetime.now())
'2019-09-21 21:53:36.938793'

The modified date from the second request is kind of ignored and the feed response again with (full) 68 entries. This are duplicates from the first 68 entries.

While developing my newsreader I found a lot of special feeds so that I needed to implement some workarounds and special handling routines. But here I do not know how to handle this? What would be a good workaround?

It is no option to not using ETag or modified dates for all feeds and go back to fetch all entries every time and compare them.

DBeath commented 5 years ago

The way I get around this is to not use feedparser for fetching the feeds. I use Requests for fetching, (passing Last-Modified and Etag headers if available), and then pass the response body to feedparser for the actual parsing. This way you can do whatever processing you'd like on the response before parsing the feed with feedparser.

To deal with this specific issue of receiving duplicates despite the Last-Modified header, I always hash the contents of the received feed (the response body) and save it to the database. On every fetch I compare the hash of the received feed against the saved hash, and only continue parsing the feed if the hash has changed. I then do the same for each entry, by hashing the link, title, and content; only saving or updating the entry if the hash is new.

buhtz commented 5 years ago

My real life solution is with iohttp (asynchrone). This works fine but did not resolve the problem. It does not matter if I use etag/modified date with feedparser or iohttp. My problem is not about parsing (what happens on client site) but on fetching. I do not want to fetch old entries.

The solution with the hashes is exactly this what I do not want to use. Because it is respectless to ressources of the user and all other internet components (servers, etc). It is the fault of the newsfeed-hoster/creator so she/he should feel the pain - not me as developer nor my users.

But again: The main question is how do I identifiy such wrong behaving feeds? I still implemented some check/test routines identifing some special wired* feeds. So some feeds I really do the hash thing but not for all. ~90 % of (thousned of tested feeds) doing it right with etag/modified date. The others I have to fetch the complete content every time and compare it (e.g. via hash) on my client site.

I know that all other big feedreader solutions working fine with such feeds. It is because the ignore the etag/modified feature and fetch everything everytime and do the entry-comparing on the client site wasting ressources. I am strict on that point!

How do I know that heise-feeds are wrong behaving? My test is simple: Initial fetch (68 entries), after some milliseconds a second fetch with the modified date of the initial one resulting in 0 entries. Fine it works! I do not want to test again after 10~15 minutes. ;)

A spontanouse idea: I do this tests for new feeds and repeat them all 3 months. An idea could be to do a entry-to-entry-compare (e.g. via hash) for the first e.g. 10 fetches to be sure that my initial test (resulted in modified date works fine) is valid or not.

*wired feeds: My test routine for new added newsfeeds is about to find out if the support etag, modfied date or nothing like this (resulting in a entry-compare on client site). On wired thing is that some feeds over a etag or a modified date but it still does not work. So just checking for existance of a etag is not enough! While testing thousands of feeds I found some other really wired behaviours but I can handle them. But this heise-feed freaks my out. I have no good idea how to identify such behaviour.

DBeath commented 5 years ago

Given all that, it sounds like there isn't really a solution that you're going to be happy with. You can implement all the fancy detection algorithms you like, but you will never get 100% coverage for every wrong thing that's possible to do with feeds. That's why as you said most large feedreader solutions just read everything on a schedule. If the website has a problem with the load from lots of readers then it's up to them to deal with it, by fixing their last-modified and etag handling, or using a cache, or something else.

It sounds like you're developing something that's used by individual users, not a centralised service. If that's the case, then the only thing I can really suggest for this kind of situation is that if you detect that a feed is poorly performing in this way, then have the clients subscribe to the feed using Websub. If the feed doesn't support Websub, then maybe you'll just have to poll the feed from a central service yourself, and then provide your own hub, or use a service that already does this such as Superfeedr.

As for how to detect if a feed is performing in this weird way, what you suggested might just work.

buhtz commented 5 years ago

Given all that, it sounds like there isn't really a solution that you're going to be happy with. You can implement all the fancy detection algorithms you like, but you will never get 100% coverage for every wrong thing that's possible to do with feeds. That's why as you said most large feedreader solutions just read everything on a schedule.

The goal of an approach is not fix or workaround "every wrong thing that's possible". The point is that the sympthomes are often not that much. There are 1000 things that can be wrong but the symptomes (e.g. duplicate feed entries) are only 10.

The approach is to detected the sympthomes not the technical problems.

And if they are deteced I need to feedback that to the user. "This feed is broken. Contact the owner." etc The user need to understand that the feed owner did something wrong and not the feedreader itself.

You could call this user empowerment.

If the website has a problem with the load from lots of readers

That is just one of the ressource problems. Imagine the CO2 foodprint of the lacy rss solutions. It is my responsiblity as a developer (and as much more roles) to stand against things like that.

buhtz commented 5 years ago

Heise.de (the feed creator) reacted on my mails and apologized for the late reaction.

The team is working on the problem right know. I also try to establish an dialog about the root of the problem. The main question is if this can happen to other feeds, too.

DBeath commented 5 years ago

The approach is to detected the sympthomes not the technical problems. And if they are deteced I need to feedback that to the user. "This feed is broken. Contact the owner." etc The user need to understand that the feed owner did something wrong and not the feedreader itself.

This is not what you seemed to ask for in the original post. I interpreted your original post as asking for a workaround for the feed returning duplicates because it ignored the Last-Modified header after 10 or so minutes.

I'm not unsympathetic to your goal of reducing CO2 because of polling, but again that was not one of the criteria from your original post. My suggestion for that is to use Websub as much as possible, to have only a single source polling the feed.

Again, to detect if the feed is poorly performing in this instance, the algorithm you suggested seems to make sense. One other thing that you might try to avoid downloading the response, is to send an HTTP HEAD request to the feed. If the server is well-behaved, then it should send the Content-Length header back. The chances of the Content-Length header being the same if the content has changed is very low. Of course, if the server isn't respecting the Last-Modified header then there's a good chance that it won't properly handle a HEAD request either.

Heise.de (the feed creator) reacted on my mails and apologized for the late reaction.

That's good to hear.

buhtz commented 5 years ago

You are absolutely right. My initial post was not well thought out. It was more a quick and dirty think aloud posting. My apologize for this!

I did not fully understand your HTTP HEAD approach but it sounds attractive. I will do some experiments with this.

kurtmckee commented 5 years ago

@Codeberg-AsGithubAlternative-buhtz I appreciate your desire to reduce CO2 emissions. Continue working on that!

As you noted, this is not a feedparser-specific question. Please do not open issue reports to ask general questions like this. Thanks!