palewire / archiveis

A simple Python wrapper for the archive.is capturing service
https://archive.is
MIT License
188 stars 16 forks source link

429 Client Error: Too Many Requests for url: https://archive.md/ #32

Open catharsis71 opened 3 years ago

catharsis71 commented 3 years ago

This has never worked for me, I always get 429 error

Running from within Python:

>>> import archiveis
>>> archive_url = archiveis.capture("http://www.example.com/")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/.local/share/virtualenvs/username-nbgasrwQ/lib/python3.8/site-packages/archiveis/api.py", line 39, in capture
    response.raise_for_status()
  File "/home/username/.local/share/virtualenvs/username-nbgasrwQ/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://archive.md/

Running command-line:

$ archiveis https://google.com/
Traceback (most recent call last):
  File "./.local/share/virtualenvs/google-JUflU5ax/bin/archiveis", line 8, in <module>
    sys.exit(cli())
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/archiveis/api.py", line 106, in cli
    archive_url = capture(url, **kwargs)
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/archiveis/api.py", line 39, in capture
    response.raise_for_status()
  File "/home/username/.local/share/virtualenvs/google-JUflU5ax/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://archive.md/
adrelanos commented 1 year ago

Confirmed. Same here.

Except the domain atm is archive.is.

jmlingeman commented 1 year ago

Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.

bpmcdevitt commented 1 year ago

I was getting 429's also. What you can do to fix this is to try some different user agent strings. I switched mine to be Firefox on Linux and it worked:

The command-line program comes bundled with a -ua flag or --user-agent to change it.

archiveis -ua "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0" https://google.com
https://archive.md/wip/VBqdJ
tedmiston commented 1 year ago

Looking at the response that comes back it appears that it is getting caught by a CAPTCHA. Unsure what the proper way to deal with this is.

I think once you hit that captcha you're pretty much just stuck. I've tried in browser automation libraries like Playwright to do obvious things like click the captcha checkbox to no avail.

Like the comment above said, choosing a real user-agent probably helps.

Besides that, I think that their rate limiting is fairly IP based, so if you are able to distribute your requests across IPs that may help.

I have also noticed that they throttle me way more aggressively when using Cloudflare 1.1.1.1 or Warp. This may likely apply to other VPNs as well, but I haven't tested personally.

It would be nice to wrap up this project with a queuing system using the submissions as jobs and having automatic retries to make it more robust / act like a "service". I haven't seen anyone doing exactly that yet.

I have also noticed that, while infrequent, there are definitely some pages which seem to crash the archive.is archiver and never succeed in being archived. Not sure if there's a way to report that to them.