shadowmoose / RedditDownloader

Scrapes Reddit to download media of your choice.
1.11k stars 99 forks source link

Downloads hang after a while + a couple errors #156

Open parasiteoflife opened 4 years ago

parasiteoflife commented 4 years ago

Describe the bug

After downloading for a while and RMD showing some errors (ytdl related) on the console it stops downloading and scanning for posts. The queue stays on Finished on every thread (Currently I have enabled only 3) and it does nothing, no I/O nor Internet activity.

Environment Info

Screenshots/Information

opera_2020-07-06_12-25-34

Additional context

This is what I got on the console:

YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/delightfulincrediblehellbender]
YTDL: ERROR: unable to download video data: The read operation timed out [https://redgifs.com/watch/identicalfemininebarbet]
YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/sizzlingslushycrossbill]
HTTPSConnectionPool(host='thcf1.redgifs.com', port=443): Read timed out.
HTTPSConnectionPool(host='thcf4.redgifs.com', port=443): Read timed out.
YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/accuratebadballoonfish]
YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/hardmadeupdiamondbackrattlesnake]
("Connection broken: ConnectionResetError(10054, 'Se ha forzado la interrupción de una conexión existente por el host remoto', None, 10054, None)", ConnectionResetError(10054, 'Se ha forzado la interrupción de una conexión existente por el host remoto', None, 10054, None))
YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/hastyyearlykitten]
YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/imperfectoddballchital]
YTDL: ERROR: giving up after 0 retries [https://redgifs.com/watch/ecstaticsmartbrownbear]
("Connection broken: ConnectionResetError(10054, 'Se ha forzado la interrupción de una conexión existente por el host remoto', None, 10054, None)", ConnectionResetError(10054, 'Se ha forzado la interrupción de una conexión existente por el host remoto', None, 10054, None))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
YTDL: ERROR: Unable to download JSON metadata: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. [https://gfycat.com/OnlyIncompleteIcelandichorse]
HTTPSConnectionPool(host='mega.nz', port=443): Read timed out. (read timeout=10)
YTDL: ERROR: Unable to download JSON metadata: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. [https://gfycat.com/powerlessgrotesqueatlanticspadefish]
YTDL: ERROR: Unable to download JSON metadata: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. [https://gfycat.com/greedysomecondor]
HTTPConnectionPool(host='wp-content', port=80): Max retries exceeded with url: /uploads/fbrfg/apple-touch-icon.png (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001806784EDC8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

The only links RMD (or ytdl) have problems with are the ones that redirect to gifdeliverynetwork or redgifs, every other image/video downloads fine and I guess that's what makes RMD hang? If I close and reopen RMD the same happens again, after a while everything stops.

shadowmoose commented 4 years ago

Interesting bug - failure to download shouldn't hand RMD like that. I'll look into it. Thanks for the detailed report.

parasiteoflife commented 4 years ago

Hey, any update on this? I can't download as it always stops at some point (seconds) after starting.

shadowmoose commented 4 years ago

Hey, I can't replicate this issue - none of these errors should interfere with the shutdown process. It's possible that RMD is simply taking a long time to finish re-scanning all the Sources it has.

My plan is to add some more UI feedback to help indicate which part of the process RMD may be sticking on.

parasiteoflife commented 4 years ago

Thanks, will wait for the next update, hopefully with that this problem can be solved.

EDIT: Btw I have left RMD open for close to 12 hours straight with the screen like the pic posted before and to this moment it hasn't changed anything. No I/O and Internet activity, RMD console hasn't changed either.

NotCompsky commented 4 years ago

I get the same issue, however it is not youtube-dl related.

Loaded Source:  subreddits
Started downloader.
WARNI [prawcore] Retrying due to ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16)")) status: GET https://oauth.reddit.com/r/ImaginaryWitcher/new
Process RedditElementLoader:
Traceback (most recent call last):
File "multiprocessing/process.py", line 297, in _bootstrap
File "redditdownloader/processing/redditloader.py", line 30, in run
File "redditdownloader/processing/redditloader.py", line 51, in load
File "redditdownloader/processing/redditloader.py", line 65, in _scan_sources
File "redditdownloader/sources/subreddit_posts_source.py", line 16, in get_elements
File "redditdownloader/static/praw_wrapper.py", line 130, in subreddit_posts
File "redditdownloader/static/praw_wrapper.py", line 218, in _praw_apply_filter
File "site-packages/praw/models/listing/generator.py", line 62, in __next__
File "site-packages/praw/models/listing/generator.py", line 72, in _next_batch
File "site-packages/praw/reddit.py", line 497, in get
File "site-packages/praw/reddit.py", line 584, in _objectify_request
File "site-packages/praw/reddit.py", line 765, in request
File "site-packages/prawcore/sessions.py", line 339, in request
File "site-packages/prawcore/sessions.py", line 265, in _request_with_retries
prawcore.exceptions.Forbidden: received 403 HTTP response

It doesn't seem to be handling exceptions.

I'm not sure OP's issue is youtube-dl related either, because their last exception is HTTPConnectionPool ... getaddrinfo failed')) - and it might be difficult to reproduce because getaddrinfo is a rare exception to get on most networks.

The scraper manages to download usually between 100 and 200 images each time before this error. I can simply restart the server and re-run the scraper to continue downloading.

Getting a 403 is a bit odd in the first place - it happens even if I restrict scraping to a single thread, but given that it gets results in the first place it can't really be an authorisation issue. I suspect it might be because there isn't enough time between downloads - the scraper may be assuming that few links will be from the i.redd.it domain, and consequently just assuming that rate limiting isn't necessary.