kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.97k stars 342 forks source link

feedparser.parse() does not return, causing my PTB job to be stuck #263

Closed furiousxk closed 3 years ago

furiousxk commented 3 years ago

Hi,

I have a small python bot which scans RSS feeds on an interval, every N seconds the job is triggered to iterate over feeds saved in a sqlite3 database and fetch the feed, it then goes on to check whether the DB already has the feed message and if not, broadcast it over telegram.

For quite some time now i've had to reboot the bot on a dynamic time interval, after a while it just seems that feedparser.parse() no longer returns, causing the job to be forever pending.

It took me quite some time to figure out that it's feedparser that's not returning, at first I thought it was some I/O thing related to sqlite3, the bot also runs in a docker container and I assumed it could be related to that but it's neither.

Please see code snippet of jobs.py below. In the snippet, db.get_all_feeds() returns a list of tuples where tuple[0] == feed_name and tuple[1] == feed_url.

def rss_monitor(context):
    feeds = db.get_all_feeds()
    for feed in feeds:
        preview = db.get_preview(feed[0])
        ... # Here we check whether the feed requires a cookie or not, if so append it to headers
        rss = feedparser.parse(feed[1], request_headers=headers) <- THIS LINE DOES NOT RETURN AFTER N ITERATIONS
        if rss.status == 200:
            # Process feed, check if message exists in database and if not, broadcast it over telegram.
        else:
            logger.error('Could not fetch feed: ' + feed[1])
            logger.error('Feed HTTP response_code: ' + `str(rss.status))

N is dynamic, I cannot reproduce this for a given number, some times the job fails after 10h, sometimes it fails after 15h, some times it works fine for 24h.

I am using feedparser==6.0.2 which is as far as I know the latest version of feedparser. Is there anything else I can do to let feedparser throw an error or perhaps hint to why it is no longer returning? If any additional information is required I will gladly supply it

kurtmckee commented 3 years ago

Hi @furiousxk! This currently doesn't look like a bug in feedparser, but let's double-check.

every N seconds the job is triggered

This line catches my eye. Many popular websites do not allow their sites to be hammered repeatedly with requests. For example, LiveJournal's policy in the past was that only X number of requests could be issued per day.

        if rss.status == 200:
            # Process feed, check if message exists in database and if not, broadcast it over telegram.
        else:
            logger.error('Could not fetch feed: ' + feed[1])
            logger.error('Feed HTTP response_code: ' + str(rss.status))

What is the HTTP status that is returned by the various websites?

furiousxk commented 3 years ago

Hello Kurt,

Whilst I agree that not a lot of popular sites allow it to be hammered with requests I would expect the request to contain a non 200 status code then. Which does not appear to be happening. Right now the implementation is running in logging.INFO so I don't have my debug logs at hand, as soon as I do I will share them here.

Furthermore I would like to add that I understand you say this does not appear to be a feedparser issue. Yet the people from python-telegram-bot claim the same and i'm stuck with the problem. As far as the other question goes, it's not some absurd interval, every 60 seconds the bot fetches all its feeds, which are around 25 and iterates over them. it's at some point in time feedparser just does not return and I do not understand why. I would expect either:

As soon as I can provide more information i will, is there anything I can do in the meantime to have some more logs as far as feedparser goes.

Thanks in advance. Kind regards

kurtmckee commented 3 years ago

feedparser just does not return

I misunderstood. You're saying that the parse() call isn't returning at all, and your software hangs indefinitely?

Feedparser doesn't support a timeout for historical reasons, and I'm intending to rip out the HTTP client code soon because it was all written 20 years ago when the world of URL fetching in Python was less feature-ful. Now that libraries like requests exist, I generally recommend that developers use requests to fetch their feed documents and pass the contents in to feedparser.

Please try using requests and passing the feed contents to feedparser. That may help resolve this issue.

furiousxk commented 3 years ago

Hi Kurt,

You're saying that the parse() call isn't returning at all, and your software hangs indefinitely?

Yes, that's exactly what's happening, at some point in time, on a non static timeframe, feedparser.parse() just does not return, causing the job to hang indefinitely. The internal mechanism of python-telegram-bot however does allow me to set a maximum # of times the job can run but changing this from the default (1) to 2/3/4 does not change the issue. It just runs until it hits the maximum and then hangs indefinitely again.

I found some debug logs, I will provide them here later during the day.

In the meantime I will look into using requests and passing the content to feedparser, will update soon. Thanks in advance, Kind regards

furiousxk commented 3 years ago

Hi Kurt,

Here to provide a small update. I havn't been able to resolve the issue so far, I've taken your recommendation to use requests and now check the HTTP_STATUS_CODE before passing it to feedparser.parse() but still nothing. Please see the code (sample is stripped from any logger entries below):

feeds = db.get_all_feeds()
for feed in feeds:
    response = requests.get(feed[1])
    if response.status_code == 200:
        if response.content is not None:
            rss = feedparser.parse(response.content) <- THIS LINE DOES NOT RETURN AFTER N ITERATIONS
            for message in rss.entries:
                exists = db.check_if_feed_message_exists(feed[0], message['link'])
                if not exists:
                    db.insert_feed_message(feed[0], message['link'])
                    # process message to send to telegram
        else:
            # log that response.content is empty.
     else:
            # log that response.status_code != 200

I am pretty confused about this whole issue and manual intervention to restart the telegram bot is starting to get tedious.

python_telegram_bot uses APSchedulerinternally which enables me to add a max_instances kwarg to the job_queue but this does not help either, at a given point in time feedparser.parse does not return, a second entry is scheduled and executed untill feedparser.parse() does not return again which by now requires manual intervention (or me to program a period scheduler check, check whether the job is hanging and terminate/reschedule hanging jobs <- which sounds like a bad idea since this should work just fine).

I would like to resolve this issue once and for all, wrapping feedparser.parse() around a try/catch block does not help. I have my logger set to DEBUG but literally nothing is happening. The line before feedparser.parse() get's logged and it just never continous. I would like to ask whether there is anything else I can do to debug this issue further.

Thanks in advance, Kind regards.

kurtmckee commented 3 years ago

Okay, it seems this is unrelated to feedparser's HTTP client code.

Would you dump the following information to a file immediately before calling feedparser.parse():

It will be very helpful to understand what the server is responding with, as well as the content that feedparser is choking on.

Thank you for your help to investigate this!

kurtmckee commented 3 years ago

@furiousxk it's possible that this is related to the regular expressions that are used in sgmllib for parsing broken XML. Perhaps there's a pathological backtracking issue in there.

If you're able to dump the requested info before calling feedparser.parse() it may help identify the root cause of this issue.

kurtmckee commented 3 years ago

@furiousxk, please post the requested information when you have an opportunity. This may help with figuring out what's going on here. Thanks!

jiamo commented 2 years ago

Hi. I got a job hang and ctrl-c to get

  File "/home/ubuntu/.cache/pypoetry/virtualenvs/news-Zy8XHzni-py3.10/lib/python3.10/site-packages/feedparser/api.py", line 216, in parse
    data = _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers, request_headers, result)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/news-Zy8XHzni-py3.10/lib/python3.10/site-packages/feedparser/api.py", line 115, in _open_resource
    return http.get(url_file_stream_or_string, etag, modified, agent, referrer, handlers, request_headers, result)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/news-Zy8XHzni-py3.10/lib/python3.10/site-packages/feedparser/http.py", line 171, in get
    f = opener.open(request)
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 1377, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 1352, in do_open
    r = h.getresponse()
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/http/client.py", line 1368, in getresponse
    response.begin()
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/http/client.py", line 317, in begin
    version, status, reason = self._read_status()
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/http/client.py", line 278, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt

Do we need a timeout arg for parse

dsaltyfrere commented 2 years ago

Hi @kurtmckee,

My apologies for only providing an update now. In the meantime, i've switched to raw requests and use xmltodict to parse what I need, it works like a charm and there are no hangs anymore.

If needed I can fork my existing implementation and revert to feedparser if needed, I have no idea how often this issue occurs for others.

To summarize: I have a python-telegram-bot using APScheduler to schedule a repeating job which fetches hundreds of RSS feeds. After N time feedparser.parse(rss_endpoint) does not return, causing APScheduler to not repeat as it sees an already running job. The only solution is to restart the container/application.

It's been a while since I've reproduced this but at the time I couldn't pinpoint this on python-telegram-bot or it's internal scheduler, APScheduler.

Please reiterate the necessary information you need from my side to properly debug this, once I have forked my existing implementation and reintroduce feedparser I will share an MVP which reproduces this.

edit: I've commented with a different account than the one I made the thread with, just confirming here I am indeed OP. if needed I can provide proof.

kurtmckee commented 2 years ago

@jiamo No timeout arguments will be added to feedparser. Its HTTP client code will be replaced with the requests module in the future as an optional dependency.

kurtmckee commented 2 years ago

@dsaltyfrere Thanks for the summary. Any chance that you're using a timeout on the requests that you're sending?

furiousxk commented 2 years ago

Hi Kurt,

I can't immediately recall whether I had a requests timeout configured then (although this may be more related to the date rather than my recollection of it). Back then the app wasn't properly managed under source control and I only found parts back from my original feedparser implementation. In the meantime, i've build a small POC below. Now, the crash/hang hasn't happened yet, as it's hard to reproduce i'm just letting it run for now untill it crashes. Any hints or additional things you'd like to see in the output?

import feedparser, requests, logging, time

logger = logging.getLogger()
logging.basicConfig(format='%(levelname)s - %(asctime)s - %(message)s',
                    level=logging.INFO)
feeds = [
    "news",
    "politics",
    "music",
    "funny",
    "comedy",
    "games",
    "movies"
    "python"
]
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux i686; rv:95.0) Gecko/20100101 Firefox/95.0'
}

while True:
    for feed in feeds:
        response = requests.get(f"https://www.reddit.com/r/{feed}/.rss", headers=headers)
        logger.info(f"response.status_code: {response.status_code}")
        if response.status_code == 200:
            logger.info(f"Parsing {feed}")
            try:
                rss = feedparser.parse(response.content)

                for message in rss.entries:
                    logger.info(f"{message['title']} | {message['link']}")
            except Exception as exception:
                logger.error(exception, exc_info=True)
    time.sleep(30)

edit: Added try/except block.

gety9 commented 1 year ago

@furiousxk

if rss.status == 200:
            # Process feed, check if message exists in database and if not, broadcast it over telegram.

could you please tell how you decide if entry is already in database? (already read). Do you simply look at url?

(not related to the issue, just trying to learn)

furiousxk commented 1 year ago

@furiousxk

if rss.status == 200:
            # Process feed, check if message exists in database and if not, broadcast it over telegram.

could you please tell how you decide if entry is already in database? (already read). Do you simply look at url?

(not related to the issue, just trying to learn)

Hi,

This project is no longer under active development from my side. The issue itself I mitigated by parsing the XML myself.

To answer your question, this project used Sqlite3 peewee and it's SqliteQueueDatabase provided by playhouse.sqliteq. It used a simple lookup using peewee's get_or_none based on the RSS's link and title.

...
e = Entry.get_or_none(Entry.link == message['link'], Entry.title == message['title'])
if e is None:
    ...

edit: I should mention that checking for the link is probably sufficient, my client had multiple RSS feeds defined which sometimes returned the same article from a different source, hence the additional check on title.

gety9 commented 1 year ago

@furiousxk thank you