Closed ankostis closed 3 years ago
I suggest to add a message in the --help
message, warning users about the use of --all-sites
and the IP-banning.
The message should recommend instead the use of --site
for all stackexchange site that the user has actually contributed posts, replies & comments (unfortunately, the user's profile does not list sites you have cast votes).
An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).
Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?
Ah yes indeed, it's pretty annoying. And yeah, that's why I added --site
option -- I ended up only running it for a few sites (instead of the whole network)
But I had no idea it banned for 24h! Maybe it's recent, but anyway would be good to add to readme yeah.
I think there are a few options, although need to think of all the pros and cons before implementing any (because it might be a fair emount of work):
An even better facility would be for the exporter to download the list of SE sites that user has registered, and download only those sites (hoping it does not trigger the ban).
Yeah, I think that makes a lot of sense! Can't remember whether there was such an endpoint, but maybe it's possible to reuse https://github.com/karlicoss/stexport/blob/e93ec392ccec88531a0ac05cd04b0750d4dc077a/src/stexport/export.py#L4 for it? Maybe there is some meta information in the result of this call that would tell if you need to do any further calls for this site at all.
Is there some index on each site, describing which type of content the user has submitted, to make selectively only download requests, and avoid reaching the ban limit?
Perhaps this? https://github.com/karlicoss/stexport/blob/e93ec392ccec88531a0ac05cd04b0750d4dc077a/src/stexport/export.py#L3-L64
I see that RFE-4 does not make sense, it's just 5 urls-per-site.
REF-3 is the important stuff - i got my list of sites to scrap from this: https://stackexchange.com/users/263317/ankostis?tab=accounts
oh btw about that
the exporter crashed, without storing any of the downloaded data
Yeah indeed, it's also kind of a problem -- a consequence of the way exporters work -- they output data to stdout (for simplicity) and then it's dumped atomically. Even if it was written in the process, since it's a single JSON structure (dictionary in this case), it would be malformed unless it's complete, so would require some manual intervention to make it well-formed.
Maybe it won't be necessary in this case if we make less requests and we get away with it -- but a more general way to do this might be to let the export files be JSONL. So for example, it could dump a json per site on each line. It would allow it to be backed by a single text file (so we keep simplicity), but also flexible enough and easy to assemble back by dal.py
(just need to read the input file exhaustively instead of one json.load
as before). Also with single --site
, it would be exactly the same output as before, which is kind of nice I guess.
Weird that you got hit with a full day? The limits they disclose are:
And regarding RFE-3, looks like it could leverage the me/associated-users
endpoint.
EDIT: Made PR for RFE-3 in #3
Wow, yeah, I've made maybe 300 requests total while testing and I just got the 24hr ban... Wasn't able to get a full export unfortunately
Some things I noticed from my own experiments:
Quota is returned in the raw api response: https://github.com/karlicoss/stexport/blob/e9e11296a6b2b89b4786db1b94b24e43611003c7/src/stexport/export.py#L134-L138 (we only extract items
from it, so we could at least inspect it beforehand to avoid ban). However there is and issue: https://github.com/AWegnerGitHub/stackapi/issues/41 which isn't released on pypi yet (https://github.com/AWegnerGitHub/stackapi/issues/44) the author kindly released it now! So after pip3 install --user stackapi --upgrade
now it reports stats correctly:
'quota_max': 300,
'quota_remaining': 28,
So maybe at the very least it would be possible to warn the user if we're about to make too many requests (i.e. len(ENDPOINTS) * len(sites) > 300
And I think I figured it out :) The 'site apis' didn't get api parameters so they ended up with 300 requests default limit instead of 10K. Once we merge @Cobertos PR (which still makes sense nevertheless!) I can also merge it and hopefully it will resolve this? https://github.com/karlicoss/stexport/blob/4a3687bdf6d58767b6b1c81624a7ca837f5d2b7d/src/stexport/export.py#L150
P.S. also noticed some weird Expecting value: line 1 column 1 (char 0)
failure at times when exporting everything (seems to happen only on specific communities somehow, guess that's why it hasn't happened before).
Also worked around: https://github.com/karlicoss/stexport/commit/c2ae2f7eddc8b654936303d975e9bcf596cbe39c , will merge after #3
After merging origin/fix locally, I was able run the full export. fetch_backoff backed off about 6 times (each time it backed off like 7 times of up to 120s), but I got through all of the below sites without being banned :3
exporting ['academia', 'ai', 'alcohol', 'android', 'arduino', 'askubuntu', 'bicycles', 'blender', 'codegolf', 'codereview', 'computergraphics', 'diy', 'electronics', 'english', 'gamedev', 'gaming', 'gis', 'interpersonal', 'japanese', 'math', 'mechanics', 'money', 'movies', 'music', 'outdoors', 'parenting', 'physics', 'politics', 'raspberrypi', 'security', 'serverfault', 'skeptics', 'softwareengineering', 'softwarerecs', 'stackapps', 'stackoverflow', 'superuser', 'travel', 'unix', 'video', 'webapps', 'webmasters', 'workplace', 'worldbuilding', 'writing']
Ok, I guess this is fixed in master! I also updated readme about getting access_token
(it's just a matter of copying URL so figured not worth a --login
functionality)
Thanks everyone, now we can finally properly export all of it!
When authenticated with a stackexchange app and downloading all sites, stack-exchange blocked export with a
throttle_violation(502)
error, and the exporter crashed, without storing any of the downloaded data.The expected behavior would be to save data collected in a try-finaly case.
Repored against e93ec39(Dec 3 2021)
RFE-1
Add
--version
so that reports like that can provide the version of the tool each issue refers to.