Open catharsis71 opened 3 years ago
Confirmed. Same here.
Except the domain atm is archive.is
.
Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.
I was getting 429's also. What you can do to fix this is to try some different user agent strings. I switched mine to be Firefox on Linux and it worked:
The command-line program comes bundled with a -ua
flag or --user-agent
to change it.
archiveis -ua "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0" https://google.com
https://archive.md/wip/VBqdJ
Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.
I think once you hit that captcha you're pretty much just stuck. I've tried in browser automation libraries like Playwright to do obvious things like click the captcha checkbox to no avail.
Like the comment above said, choosing a real user-agent probably helps.
Besides that, I think that their rate limiting is fairly IP based, so if you are able to distribute your requests across IPs that may help.
I have also noticed that they throttle me way more aggressively when using Cloudflare 1.1.1.1 or Warp. This may likely apply to other VPNs as well, but I haven't tested personally.
It would be nice to wrap up this project with a queuing system using the submissions as jobs and having automatic retries to make it more robust / act like a "service". I haven't seen anyone doing exactly that yet.
I have also noticed that, while infrequent, there are definitely some pages which seem to crash the archive.is archiver and never succeed in being archived. Not sure if there's a way to report that to them.
This has never worked for me, I always get 429 error
Running from within Python:
Running command-line: