Open lemixtape opened 1 week ago
I thought it would create log files and also a dump file(dumps temp file which contains what it has fetched until crashing) when it crashes. Did you check that?
Regardless I did push a change to the repo to handle that error. It's not on pypi so you'll need to manually install it.
plz update on that :)
also, pullpush.io is having trouble with the server recently maybe that has to do with it
Even if it dumps what is has so far, is there a way to resume from where it last stopped? Thank you for the quick update!
Ah thinking about it that was an oversight from me, I thought it could easily be resumed. but due to it splitting request in a timeline based way it's gonna be all messed up...
sorry but I don't think there's an easy way for now but certainly not impossible
if it used 3 tasks(default) it'll split the time range into 3, so it'll look something like this: (green: complete, red: failed)
I might have to find a way to make recoveries easier later
Thank you. I think it would be very useful as some subreddits have over 100,000 submissions to collect, and take weeks to complete. It really hurts when it crashes near the end not to be able to resume. Perhaps write in a file in the directory what it's last request was, and when you resume, the script would fetch all the data it collected before, and resume where it last stood. Resuming would imply that you continue with the same number of tasks.
I was also thinking that it may be useful to write to the JSON after each request, rather than dump. I am not sure how much memory will be needed to store all 100,000 subreddit, but it may be more than what is available on most computers.
yeah I might make it dump some sort of config file if it fails so that it can read it and pick up from where it failed.
Also, for data that much, I think it's better to use arctic shift's dumps which is updated every month. files are very big but at least there ain't any fear of failing during that. This is meant to fetch small~moderate amount of data.
however I am planning to implement arctic shift's API after stuffs are finalized there and if I have time since their API search options and stability looks better.
It seems that the API is returning malformed JSON and this is a huge problem for large subreddits, as we have to restart the data collection, and possibly never finish the work as it keeps crashing along the way. Would it be possible to add better error handling so that the script keeps on going when it hits these JSONs, and keep a record of that in the logs?
Here are two examples.
502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=10qlmx7'
Here is another example:
502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=14na6nc'