Closed furiousxk closed 3 years ago
Hi @furiousxk! This currently doesn't look like a bug in feedparser, but let's double-check.
every N seconds the job is triggered
This line catches my eye. Many popular websites do not allow their sites to be hammered repeatedly with requests. For example, LiveJournal's policy in the past was that only X number of requests could be issued per day.
if rss.status == 200:
# Process feed, check if message exists in database and if not, broadcast it over telegram.
else:
logger.error('Could not fetch feed: ' + feed[1])
logger.error('Feed HTTP response_code: ' + str(rss.status))
What is the HTTP status that is returned by the various websites?
Hello Kurt,
Whilst I agree that not a lot of popular sites allow it to be hammered with requests I would expect the request to contain a non 200 status code then. Which does not appear to be happening. Right now the implementation is running in logging.INFO so I don't have my debug logs at hand, as soon as I do I will share them here.
Furthermore I would like to add that I understand you say this does not appear to be a feedparser issue. Yet the people from python-telegram-bot claim the same and i'm stuck with the problem. As far as the other question goes, it's not some absurd interval, every 60 seconds the bot fetches all its feeds, which are around 25 and iterates over them. it's at some point in time feedparser just does not return and I do not understand why. I would expect either:
As soon as I can provide more information i will, is there anything I can do in the meantime to have some more logs as far as feedparser goes.
Thanks in advance. Kind regards
feedparser just does not return
I misunderstood. You're saying that the parse()
call isn't returning at all, and your software hangs indefinitely?
Feedparser doesn't support a timeout for historical reasons, and I'm intending to rip out the HTTP client code soon because it was all written 20 years ago when the world of URL fetching in Python was less feature-ful. Now that libraries like requests exist, I generally recommend that developers use requests to fetch their feed documents and pass the contents in to feedparser.
Please try using requests and passing the feed contents to feedparser. That may help resolve this issue.
Hi Kurt,
You're saying that the
parse()
call isn't returning at all, and your software hangs indefinitely?
Yes, that's exactly what's happening, at some point in time, on a non static timeframe, feedparser.parse()
just does not return, causing the job to hang indefinitely. The internal mechanism of python-telegram-bot however does allow me to set a maximum # of times the job can run but changing this from the default (1) to 2/3/4 does not change the issue. It just runs until it hits the maximum and then hangs indefinitely again.
I found some debug logs, I will provide them here later during the day.
In the meantime I will look into using requests and passing the content to feedparser, will update soon. Thanks in advance, Kind regards
Hi Kurt,
Here to provide a small update.
I havn't been able to resolve the issue so far, I've taken your recommendation to use requests
and now check the HTTP_STATUS_CODE
before passing it to feedparser.parse()
but still nothing.
Please see the code (sample is stripped from any logger
entries below):
feeds = db.get_all_feeds()
for feed in feeds:
response = requests.get(feed[1])
if response.status_code == 200:
if response.content is not None:
rss = feedparser.parse(response.content) <- THIS LINE DOES NOT RETURN AFTER N ITERATIONS
for message in rss.entries:
exists = db.check_if_feed_message_exists(feed[0], message['link'])
if not exists:
db.insert_feed_message(feed[0], message['link'])
# process message to send to telegram
else:
# log that response.content is empty.
else:
# log that response.status_code != 200
I am pretty confused about this whole issue and manual intervention to restart the telegram bot is starting to get tedious.
python_telegram_bot
uses APScheduler
internally which enables me to add a max_instances
kwarg to the job_queue but this does not help either, at a given point in time feedparser.parse
does not return, a second entry is scheduled and executed untill feedparser.parse()
does not return again which by now requires manual intervention (or me to program a period scheduler check, check whether the job is hanging and terminate/reschedule hanging jobs <- which sounds like a bad idea since this should work just fine).
I would like to resolve this issue once and for all, wrapping feedparser.parse() around a try/catch block does not help. I have my logger set to DEBUG
but literally nothing is happening. The line before feedparser.parse() get's logged and it just never continous.
I would like to ask whether there is anything else I can do to debug this issue further.
Thanks in advance, Kind regards.
Okay, it seems this is unrelated to feedparser's HTTP client code.
Would you dump the following information to a file immediately before calling feedparser.parse()
:
It will be very helpful to understand what the server is responding with, as well as the content that feedparser is choking on.
Thank you for your help to investigate this!
@furiousxk it's possible that this is related to the regular expressions that are used in sgmllib for parsing broken XML. Perhaps there's a pathological backtracking issue in there.
If you're able to dump the requested info before calling feedparser.parse()
it may help identify the root cause of this issue.
@furiousxk, please post the requested information when you have an opportunity. This may help with figuring out what's going on here. Thanks!
Hi. I got a job hang and ctrl-c to get
File "/home/ubuntu/.cache/pypoetry/virtualenvs/news-Zy8XHzni-py3.10/lib/python3.10/site-packages/feedparser/api.py", line 216, in parse
data = _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers, request_headers, result)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/news-Zy8XHzni-py3.10/lib/python3.10/site-packages/feedparser/api.py", line 115, in _open_resource
return http.get(url_file_stream_or_string, etag, modified, agent, referrer, handlers, request_headers, result)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/news-Zy8XHzni-py3.10/lib/python3.10/site-packages/feedparser/http.py", line 171, in get
f = opener.open(request)
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 1377, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/urllib/request.py", line 1352, in do_open
r = h.getresponse()
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/http/client.py", line 1368, in getresponse
response.begin()
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/http/client.py", line 317, in begin
version, status, reason = self._read_status()
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/http/client.py", line 278, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/home/ubuntu/.pyenv/versions/3.10.0/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt
Do we need a timeout
arg for parse
Hi @kurtmckee,
My apologies for only providing an update now.
In the meantime, i've switched to raw requests
and use xmltodict
to parse what I need, it works like a charm and there are no hangs anymore.
If needed I can fork my existing implementation and revert to feedparser
if needed, I have no idea how often this issue occurs for others.
To summarize:
I have a python-telegram-bot
using APScheduler
to schedule a repeating job which fetches hundreds of RSS feeds. After N time feedparser.parse(rss_endpoint)
does not return, causing APScheduler
to not repeat as it sees an already running job. The only solution is to restart the container/application.
It's been a while since I've reproduced this but at the time I couldn't pinpoint this on python-telegram-bot
or it's internal scheduler, APScheduler
.
Please reiterate the necessary information you need from my side to properly debug this, once I have forked my existing implementation and reintroduce feedparser
I will share an MVP which reproduces this.
edit: I've commented with a different account than the one I made the thread with, just confirming here I am indeed OP. if needed I can provide proof.
@jiamo No timeout arguments will be added to feedparser. Its HTTP client code will be replaced with the requests module in the future as an optional dependency.
@dsaltyfrere Thanks for the summary. Any chance that you're using a timeout on the requests that you're sending?
Hi Kurt,
I can't immediately recall whether I had a requests timeout configured then (although this may be more related to the date rather than my recollection of it). Back then the app wasn't properly managed under source control and I only found parts back from my original feedparser implementation. In the meantime, i've build a small POC below. Now, the crash/hang hasn't happened yet, as it's hard to reproduce i'm just letting it run for now untill it crashes. Any hints or additional things you'd like to see in the output?
import feedparser, requests, logging, time
logger = logging.getLogger()
logging.basicConfig(format='%(levelname)s - %(asctime)s - %(message)s',
level=logging.INFO)
feeds = [
"news",
"politics",
"music",
"funny",
"comedy",
"games",
"movies"
"python"
]
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686; rv:95.0) Gecko/20100101 Firefox/95.0'
}
while True:
for feed in feeds:
response = requests.get(f"https://www.reddit.com/r/{feed}/.rss", headers=headers)
logger.info(f"response.status_code: {response.status_code}")
if response.status_code == 200:
logger.info(f"Parsing {feed}")
try:
rss = feedparser.parse(response.content)
for message in rss.entries:
logger.info(f"{message['title']} | {message['link']}")
except Exception as exception:
logger.error(exception, exc_info=True)
time.sleep(30)
edit: Added try/except block.
@furiousxk
if rss.status == 200:
# Process feed, check if message exists in database and if not, broadcast it over telegram.
could you please tell how you decide if entry is already in database? (already read). Do you simply look at url?
(not related to the issue, just trying to learn)
@furiousxk
if rss.status == 200: # Process feed, check if message exists in database and if not, broadcast it over telegram.
could you please tell how you decide if entry is already in database? (already read). Do you simply look at url?
(not related to the issue, just trying to learn)
Hi,
This project is no longer under active development from my side. The issue itself I mitigated by parsing the XML myself.
To answer your question, this project used Sqlite3 peewee
and it's SqliteQueueDatabase
provided by playhouse.sqliteq
.
It used a simple lookup using peewee's get_or_none
based on the RSS's link and title.
...
e = Entry.get_or_none(Entry.link == message['link'], Entry.title == message['title'])
if e is None:
...
edit: I should mention that checking for the link is probably sufficient, my client had multiple RSS feeds defined which sometimes returned the same article from a different source, hence the additional check on title.
@furiousxk thank you
Hi,
I have a small python bot which scans RSS feeds on an interval, every N seconds the job is triggered to iterate over feeds saved in a sqlite3 database and fetch the feed, it then goes on to check whether the DB already has the feed message and if not, broadcast it over telegram.
For quite some time now i've had to reboot the bot on a dynamic time interval, after a while it just seems that feedparser.parse() no longer returns, causing the job to be forever pending.
It took me quite some time to figure out that it's feedparser that's not returning, at first I thought it was some I/O thing related to sqlite3, the bot also runs in a docker container and I assumed it could be related to that but it's neither.
Please see code snippet of
jobs.py
below. In the snippet,db.get_all_feeds()
returns a list of tuples wheretuple[0] == feed_name
andtuple[1] == feed_url
.N is dynamic, I cannot reproduce this for a given number, some times the job fails after 10h, sometimes it fails after 15h, some times it works fine for 24h.
I am using
feedparser==6.0.2
which is as far as I know the latest version of feedparser. Is there anything else I can do to let feedparser throw an error or perhaps hint to why it is no longer returning? If any additional information is required I will gladly supply it