karlicoss / stexport

Export and access your Stackexchange data
MIT License
11 stars 1 forks source link

RFE: store exported data even if blocked due to `throttle violation` #1

Closed ankostis closed 3 years ago

ankostis commented 3 years ago

When authenticated with a stackexchange app and downloading all sites, stack-exchange blocked export with a throttle_violation(502) error, and the exporter crashed, without storing any of the downloaded data.

The expected behavior would be to save data collected in a try-finaly case.

./stexport.git $ python -m stexport.export \
        --user_id ... --key ... --access_token ... \
        --all-sites \
        stackexchange-$(date +%Y%m%d).json
[I 210308 10:16:14 export:149] exporting ['3dprinting', '3dprinting.meta', 'academia', 'academia.meta', 'ai', 'ai.meta', 'alcohol', 'alcohol.meta', 'android', 'android.meta', 'anime', 'anime.meta', 'apple', 'apple.meta', 'arduino', 'arduino.meta', 'askubuntu', 'astronomy', 'astronomy.meta', 'aviation', 'aviation.meta', 'bicycles', 'bicycles.meta', 'bioinformatics', 'bioinformatics.meta', 'biology', 'biology.meta', 'bitcoin', 'bitcoin.meta', 'blender', 'blender.meta', 'boardgames', 'boardgames.meta', 'bricks', 'bricks.meta', 'buddhism', 'buddhism.meta', 'chemistry', 'chemistry.meta', 'chess', 'chess.meta', 'chinese', 'chinese.meta', 'christianity', 'christianity.meta', 'civicrm', 'civicrm.meta', 'codegolf', 'codegolf.meta', 'codereview', 'codereview.meta', 'coffee', 'coffee.meta', 'communitybuilding', 'communitybuilding.meta', 'computergraphics', 'computergraphics.meta', 'conlang', 'conlang.meta', 'cooking', 'cooking.meta', 'craftcms', 'craftcms.meta', 'crafts', 'crafts.meta', 'crypto', 'crypto.meta', 'cs', 'cs.meta', 'cs50', 'cs50.meta', 'cseducators', 'cseducators.meta', 'cstheory', 'cstheory.meta', 'datascience', 'datascience.meta', 'dba', 'dba.meta', 'devops', 'devops.meta', 'diy', 'diy.meta', 'drones', 'drones.meta', 'drupal', 'drupal.meta', 'dsp', 'dsp.meta', 'earthscience', 'earthscience.meta', 'ebooks', 'ebooks.meta', 'economics', 'economics.meta', 'electronics', 'electronics.meta', 'elementaryos', 'elementaryos.meta', 'ell', 'ell.meta', 'emacs', 'emacs.meta', 'engineering', 'engineering.meta', 'english', 'english.meta', 'eosio', 'eosio.meta', 'es.meta.stackoverflow', 'es.stackoverflow', 'esperanto', 'esperanto.meta', 'ethereum', 'ethereum.meta', 'expatriates', 'expatriates.meta', 'expressionengine', 'expressionengine.meta', 'fitness', 'fitness.meta', 'freelancing', 'freelancing.meta', 'french', 'french.meta', 'gamedev', 'gamedev.meta', 'gaming', 'gaming.meta', 'gardening', 'gardening.meta', 'genealogy', 'genealogy.meta', 'german', 'german.meta', 'gis', 'gis.meta', 'graphicdesign', 'graphicdesign.meta', 'ham', 'ham.meta', 'hardwarerecs', 'hardwarerecs.meta', 'hermeneutics', 'hermeneutics.meta', 'hinduism', 'hinduism.meta', 'history', 'history.meta', 'homebrew', 'homebrew.meta', 'hsm', 'hsm.meta', 'interpersonal', 'interpersonal.meta', 'iot', 'iot.meta', 'iota', 'iota.meta', 'islam', 'islam.meta', 'italian', 'italian.meta', 'ja.meta.stackoverflow', 'ja.stackoverflow', 'japanese', 'japanese.meta', 'joomla', 'joomla.meta', 'judaism', 'judaism.meta', 'korean', 'korean.meta', 'languagelearning', 'languagelearning.meta', 'latin', 'latin.meta', 'law', 'law.meta', 'lifehacks', 'lifehacks.meta', 'linguistics', 'linguistics.meta', 'literature', 'literature.meta', 'magento', 'magento.meta', 'martialarts', 'martialarts.meta', 'math', 'math.meta', 'matheducators', 'matheducators.meta', 'mathematica', 'mathematica.meta', 'mathoverflow.net', 'mattermodeling', 'mattermodeling.meta', 'mechanics', 'mechanics.meta', 'medicalsciences', 'medicalsciences.meta', 'meta', 'meta.askubuntu', 'meta.mathoverflow.net', 'meta.serverfault', 'meta.stackoverflow', 'meta.superuser', 'monero', 'monero.meta', 'money', 'money.meta', 'movies', 'movies.meta', 'music', 'music.meta', 'musicfans', 'musicfans.meta', 'mythology', 'mythology.meta', 'networkengineering', 'networkengineering.meta', 'opendata', 'opendata.meta', 'opensource', 'opensource.meta', 'or', 'or.meta', 'outdoors', 'outdoors.meta', 'parenting', 'parenting.meta', 'patents', 'patents.meta', 'pets', 'pets.meta', 'philosophy', 'philosophy.meta', 'photo', 'photo.meta', 'physics', 'physics.meta', 'pm', 'pm.meta', 'poker', 'poker.meta', 'politics', 'politics.meta', 'portuguese', 'portuguese.meta', 'psychology', 'psychology.meta', 'pt.meta.stackoverflow', 'pt.stackoverflow', 'puzzling', 'puzzling.meta', 'quant', 'quant.meta', 'quantumcomputing', 'quantumcomputing.meta', 'raspberrypi', 'raspberrypi.meta', 'retrocomputing', 'retrocomputing.meta', 'reverseengineering', 'reverseengineering.meta', 'robotics', 'robotics.meta', 'rpg', 'rpg.meta', 'ru.meta.stackoverflow', 'ru.stackoverflow', 'rus', 'rus.meta', 'russian', 'russian.meta', 'salesforce', 'salesforce.meta', 'scicomp', 'scicomp.meta', 'scifi', 'scifi.meta', 'security', 'security.meta', 'serverfault', 'sharepoint', 'sharepoint.meta', 'sitecore', 'sitecore.meta', 'skeptics', 'skeptics.meta', 'softwareengineering', 'softwareengineering.meta', 'softwarerecs', 'softwarerecs.meta', 'sound', 'sound.meta', 'space', 'space.meta', 'spanish', 'spanish.meta', 'sports', 'sports.meta', 'sqa', 'sqa.meta', 'stackapps', 'stackoverflow', 'stats', 'stats.meta', 'stellar', 'stellar.meta', 'superuser', 'sustainability', 'sustainability.meta', 'tex', 'tex.meta', 'tezos', 'tezos.meta', 'tor', 'tor.meta', 'travel', 'travel.meta', 'tridion', 'tridion.meta', 'ukrainian', 'ukrainian.meta', 'unix', 'unix.meta', 'ux', 'ux.meta', 'vegetarianism', 'vegetarianism.meta', 'vi', 'vi.meta', 'video', 'video.meta', 'webapps', 'webapps.meta', 'webmasters', 'webmasters.meta', 'windowsphone', 'windowsphone.meta', 'woodworking', 'woodworking.meta', 'wordpress', 'wordpress.meta', 'workplace', 'workplace.meta', 'worldbuilding', 'worldbuilding.meta', 'writing', 'writing.meta']
...
[I 210308 10:21:55 export:132] exporting askubuntu: users/{ids}/reputation
[I 210308 10:21:56 export:132] exporting askubuntu: users/{ids}/reputation-history
[I 210308 10:21:57 export:132] exporting askubuntu: users/{ids}/suggested-edits
[E 210308 10:21:58 _common:101] Giving up fetch_backoff(...) after 1 tries (stackapi.stackapi.StackAPIError: ('https://api.stackexchange.com/2.2/users/19769/suggested-edits/?pagesize=100&page=1&filter=%21LVBj2%28M0Wr1s_VedzkH%28VG&site=askubuntu', 502, 'throttle_violation', 'too many requests from this IP, more requests available in 83579 seconds'))
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "./stexport.git/src/stexport/export.py", line 188, in <module>
    main()
  File "./stexport.git/src/stexport/export.py", line 181, in main
    j = exporter.export_json(sites=sites)
  File "./stexport.git/src/stexport/export.py", line 153, in export_json
    all_data[site] = self.export_site(site=site)
  File "./stexport.git/src/stexport/export.py", line 134, in export_site
    data[ep] = fetch_backoff(
  File ".venv/lib/python3.9/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "./stexport.git/src/stexport/export.py", line 106, in fetch_backoff
    return api.fetch(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/stackapi/stackapi.py", line 198, in fetch
    raise StackAPIError(self._previous_call, error, code, message)
stackapi.stackapi.StackAPIError: ('https://api.stackexchange.com/2.2/users/19769/suggested-edits/?pagesize=100&page=1&filter=%21LVBj2%28M0Wr1s_VedzkH%28VG&site=askubuntu', 502, 'throttle_violation', 'too many requests from this IP, more requests available in 83579 seconds')

Repored against e93ec39(Dec 3 2021)

RFE-1

Add --version so that reports like that can provide the version of the tool each issue refers to.

ankostis commented 3 years ago

RFE-2

I suggest to add a message in the --help message, warning users about the use of --all-sites and the IP-banning. The message should recommend instead the use of --site for all stackexchange site that the user has actually contributed posts, replies & comments (unfortunately, the user's profile does not list sites you have cast votes).

ankostis commented 3 years ago

RFE-3

An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).

RFE-4

Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?

karlicoss commented 3 years ago

Ah yes indeed, it's pretty annoying. And yeah, that's why I added --site option -- I ended up only running it for a few sites (instead of the whole network) But I had no idea it banned for 24h! Maybe it's recent, but anyway would be good to add to readme yeah.

I think there are a few options, although need to think of all the pros and cons before implementing any (because it might be a fair emount of work):

karlicoss commented 3 years ago

An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).

Yeah, I think that makes a lot of sense! Can't remember whether there was such an endpoint, but maybe it's possible to reuse https://github.com/karlicoss/stexport/blob/e93ec392ccec88531a0ac05cd04b0750d4dc077a/src/stexport/export.py#L4 for it? Maybe there is some meta information in the result of this call that would tell if you need to do any further calls for this site at all.

Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?

Perhaps this? https://github.com/karlicoss/stexport/blob/e93ec392ccec88531a0ac05cd04b0750d4dc077a/src/stexport/export.py#L3-L64

ankostis commented 3 years ago

I see that RFE-4 does not make sense, it's just 5 urls-per-site.

REF-3 is the important stuff - i got my list of sites to scrap from this: https://stackexchange.com/users/263317/ankostis?tab=accounts

karlicoss commented 3 years ago

oh btw about that

the exporter crashed, without storing any of the downloaded data

Yeah indeed, it's also kind of a problem -- a consequence of the way exporters work -- they output data to stdout (for simplicity) and then it's dumped atomically. Even if it was written in the process, since it's a single JSON structure (dictionary in this case), it would be malformed unless it's complete, so would require some manual intervention to make it well-formed.

Maybe it won't be necessary in this case if we make less requests and we get away with it -- but a more general way to do this might be to let the export files be JSONL. So for example, it could dump a json per site on each line. It would allow it to be backed by a single text file (so we keep simplicity), but also flexible enough and easy to assemble back by dal.py (just need to read the input file exhaustively instead of one json.load as before). Also with single --site, it would be exactly the same output as before, which is kind of nice I guess.

Cobertos commented 3 years ago

Weird that you got hit with a full day? The limits they disclose are:

And regarding RFE-3, looks like it could leverage the me/associated-users endpoint.

EDIT: Made PR for RFE-3 in #3

Cobertos commented 3 years ago

Wow, yeah, I've made maybe 300 requests total while testing and I just got the 24hr ban... Wasn't able to get a full export unfortunately

karlicoss commented 3 years ago

Some things I noticed from my own experiments:

Quota is returned in the raw api response: https://github.com/karlicoss/stexport/blob/e9e11296a6b2b89b4786db1b94b24e43611003c7/src/stexport/export.py#L134-L138 (we only extract items from it, so we could at least inspect it beforehand to avoid ban). However there is and issue: https://github.com/AWegnerGitHub/stackapi/issues/41 which isn't released on pypi yet (https://github.com/AWegnerGitHub/stackapi/issues/44) the author kindly released it now! So after pip3 install --user stackapi --upgrade now it reports stats correctly:

 'quota_max': 300,
 'quota_remaining': 28,

So maybe at the very least it would be possible to warn the user if we're about to make too many requests (i.e. len(ENDPOINTS) * len(sites) > 300

karlicoss commented 3 years ago

And I think I figured it out :) The 'site apis' didn't get api parameters so they ended up with 300 requests default limit instead of 10K. Once we merge @Cobertos PR (which still makes sense nevertheless!) I can also merge it and hopefully it will resolve this? https://github.com/karlicoss/stexport/blob/4a3687bdf6d58767b6b1c81624a7ca837f5d2b7d/src/stexport/export.py#L150

P.S. also noticed some weird Expecting value: line 1 column 1 (char 0) failure at times when exporting everything (seems to happen only on specific communities somehow, guess that's why it hasn't happened before). Also worked around: https://github.com/karlicoss/stexport/commit/c2ae2f7eddc8b654936303d975e9bcf596cbe39c , will merge after #3

Cobertos commented 3 years ago

After merging origin/fix locally, I was able run the full export. fetch_backoff backed off about 6 times (each time it backed off like 7 times of up to 120s), but I got through all of the below sites without being banned :3

exporting ['academia', 'ai', 'alcohol', 'android', 'arduino', 'askubuntu', 'bicycles', 'blender', 'codegolf', 'codereview', 'computergraphics', 'diy', 'electronics', 'english', 'gamedev', 'gaming', 'gis', 'interpersonal', 'japanese', 'math', 'mechanics', 'money', 'movies', 'music', 'outdoors', 'parenting', 'physics', 'politics', 'raspberrypi', 'security', 'serverfault', 'skeptics', 'softwareengineering', 'softwarerecs', 'stackapps', 'stackoverflow', 'superuser', 'travel', 'unix', 'video', 'webapps', 'webmasters', 'workplace', 'worldbuilding', 'writing']
karlicoss commented 3 years ago

Ok, I guess this is fixed in master! I also updated readme about getting access_token (it's just a matter of copying URL so figured not worth a --login functionality) Thanks everyone, now we can finally properly export all of it!