mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
9 stars 15 forks source link

more (re)scrape work neeed? #791

Open philbudne opened 1 month ago

philbudne commented 1 month ago

I noticed this in the logs of the worker container (note: current date is 2024-09-18):

2024-09-16T21:44:29.615724832Z ==== starting _scrape_source(42118, https://www.itv.com/news/topic/nigeria, itv.com)
2024-09-16T21:44:29.621624096Z add_line: Scraped source 42118 (itv.com), https://www.itv.com/news/topic/nigeria
2024-09-16T21:44:29.622858689Z Starting new HTTPS connection (1): www.itv.com:443
2024-09-16T21:44:59.622339377Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:44:59.622713877Z Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='www.itv.com', port=443): Read timed out. (read timeout=None)")': /news/topic/nigeria
2024-09-16T21:44:59.623125050Z Starting new HTTPS connection (2): www.itv.com:443
2024-09-16T21:56:42.969479886Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:56:43.170047325Z Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T21:56:43.170361434Z Starting new HTTPS connection (3): www.itv.com:443
2024-09-16T22:08:25.437442087Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2024-09-16T22:08:25.838354746Z Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T22:08:25.838934554Z Starting new HTTPS connection (4): www.itv.com:443

Which suggests recscrape is hanging?

It also looks like each connection (re)try is taking 11 minutes?!

mcweb's call to feed_seeker is: new_feed_generator = feed_seeker.generate_feed_urls(homepage, max_time=SCRAPE_TIMEOUT_SECONDS) and SCRAPE_TIMEOUT_SECONDS defaults to 30.

I would also like to have feed scraping, rss-fetching, and article-fetching all use the same requests settings (see https://github.com/mediacloud/metadata-lib/issues/88)

feed_seeker DOES allow supplying a fetcher function when creating a FeedSeeker object, but the generate_feed_urls does not.

BUT it seems like feed_seeker is a Media Cloud project! (I did not know that!!)

It seems like there is a known issue with the timeout parameter: https://github.com/mediacloud/feed_seeker/issues/2

philbudne commented 1 month ago

I wonder if source/collection rescrape could be made to run as a manage.py command (calling the task code) for easier debugging?

philbudne commented 6 days ago

https://github.com/mediacloud/web-search/pull/817/files passes connect and read timeouts to HTTP get operation, so hangs should be less likely.

In a test scrapes of Nigeria State & National & North Carolina finished, but I didn't get an email for Nigeria, so it seems there may still be issues to chase! I ran the Nigeria scrape from the command line using manage.py (took 8 hours!) and got an email, so it may be hard to reproduce!!