Open philbudne opened 1 month ago
I wonder if source/collection rescrape could be made to run as a manage.py
command (calling the task code) for easier debugging?
https://github.com/mediacloud/web-search/pull/817/files passes connect and read timeouts to HTTP get
operation, so hangs should be less likely.
In a test scrapes of Nigeria State & National & North Carolina finished, but I didn't get an email for Nigeria, so it seems there may still be issues to chase! I ran the Nigeria scrape from the command line using manage.py
(took 8 hours!) and got an email, so it may be hard to reproduce!!
I noticed this in the logs of the worker container (note: current date is 2024-09-18):
Which suggests recscrape is hanging?
It also looks like each connection (re)try is taking 11 minutes?!
mcweb's call to feed_seeker is:
new_feed_generator = feed_seeker.generate_feed_urls(homepage, max_time=SCRAPE_TIMEOUT_SECONDS)
andSCRAPE_TIMEOUT_SECONDS
defaults to 30.I would also like to have feed scraping, rss-fetching, and article-fetching all use the same requests settings (see https://github.com/mediacloud/metadata-lib/issues/88)
feed_seeker DOES allow supplying a fetcher function when creating a
FeedSeeker
object, but thegenerate_feed_urls
does not.BUT it seems like feed_seeker is a Media Cloud project! (I did not know that!!)
It seems like there is a known issue with the timeout parameter: https://github.com/mediacloud/feed_seeker/issues/2