Crashing when receiving some JSON

lemixtape commented 1 week ago

It seems that the API is returning malformed JSON and this is a huge problem for large subreddits, as we have to restart the data collection, and possibly never finish the work as it keeps crashing along the way. Would it be possible to add better error handling so that the script keeps on going when it hits these JSONs, and keep a record of that in the logs?

Here are two examples.

Exception Group Traceback (most recent call last):	File "C:\Users\admin_local\Desktop\mmm\bsa.py", line 9, in	result2 = asyncio.run(ppa.get_submissions(subreddit='researchchemicals',	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run	return runner.run(main)	^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run	return self._loop.run_until_complete(task)	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete	return future.result()	^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 181, in get_submissions	comments = await self._get_link_ids_comments(submission_ids)	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 325, in _get_link_ids_comments	raise err	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 310, in _get_link_ids_comments	async with asyncio.TaskGroup() as tg:	^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\taskgroups.py", line 145, in aexit	raise me from None	ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ----------------	Traceback (most recent call last):	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 153, in make_request	result = await response.json()	^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\aiohttp\client_reqrep.py", line 1199, in json	raise ContentTypeError(	aiohttp.client_exceptions.ContentTypeError: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=10qlmx7'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 305, in link_id_worker
res.append(await make_request(self, 'comments', link_id=link_id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 198, in make_request
raise Exception(f'{coro_name}	unexpected error: \n{err}')
Exception: coro-0	unexpected error:

502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=10qlmx7'

Here is another example:

Exception Group Traceback (most recent call last):	File "C:\Users\admin_local\Desktop\mmm\bsa.py", line 9, in	result2 = asyncio.run(ppa.get_submissions(subreddit='Drugs',	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run	return runner.run(main)	^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run	return self._loop.run_until_complete(task)	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete	return future.result()	^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 181, in get_submissions	comments = await self._get_link_ids_comments(submission_ids)	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 325, in _get_link_ids_comments	raise err	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 310, in _get_link_ids_comments	async with asyncio.TaskGroup() as tg:	^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\taskgroups.py", line 145, in aexit	raise me from None	ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ----------------	Traceback (most recent call last):	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 153, in make_request	result = await response.json()	^^^^^^^^^^^^^^^^^^^^^	File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\aiohttp\client_reqrep.py", line 1199, in json	raise ContentTypeError(	aiohttp.client_exceptions.ContentTypeError: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=14na6nc'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 305, in link_id_worker
res.append(await make_request(self, 'comments', link_id=link_id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 198, in make_request
raise Exception(f'{coro_name}	unexpected error: \n{err}')
Exception: coro-0	unexpected error:

502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=14na6nc'

maxjo020418 commented 1 week ago

I thought it would create log files and also a dump file(dumps temp file which contains what it has fetched until crashing) when it crashes. Did you check that?

Regardless I did push a change to the repo to handle that error. It's not on pypi so you'll need to manually install it.

plz update on that :)

also, pullpush.io is having trouble with the server recently maybe that has to do with it

lemixtape commented 1 week ago

Even if it dumps what is has so far, is there a way to resume from where it last stopped? Thank you for the quick update!

maxjo020418 commented 1 week ago

Ah thinking about it that was an oversight from me, I thought it could easily be resumed. but due to it splitting request in a timeline based way it's gonna be all messed up...

sorry but I don't think there's an easy way for now but certainly not impossible

if it used 3 tasks(default) it'll split the time range into 3, so it'll look something like this: (green: complete, red: failed)

I might have to find a way to make recoveries easier later

lemixtape commented 1 week ago

Thank you. I think it would be very useful as some subreddits have over 100,000 submissions to collect, and take weeks to complete. It really hurts when it crashes near the end not to be able to resume. Perhaps write in a file in the directory what it's last request was, and when you resume, the script would fetch all the data it collected before, and resume where it last stood. Resuming would imply that you continue with the same number of tasks.

I was also thinking that it may be useful to write to the JSON after each request, rather than dump. I am not sure how much memory will be needed to store all 100,000 subreddit, but it may be more than what is available on most computers.

maxjo020418 commented 1 week ago

yeah I might make it dump some sort of config file if it fails so that it can read it and pick up from where it failed.

Also, for data that much, I think it's better to use arctic shift's dumps which is updated every month. files are very big but at least there ain't any fear of failing during that. This is meant to fetch small~moderate amount of data.

however I am planning to implement arctic shift's API after stuffs are finalized there and if I have time since their API search options and stability looks better.

maxjo020418 / BAScraper

Crashing when receiving some JSON #2