palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
https://palewi.re/docs/savepagenow/
MIT License
167 stars 23 forks source link

Added functionality for using savepagenow with authentication #45

Closed duckduckgrayduck closed 1 year ago

duckduckgrayduck commented 1 year ago

This PR adds the ability to use authentication to do wayback saves. The user needs to create local environment variables 'secret' which has their S3 secret key from the Internet Archive and 'access_key' which has their access key from the Internet archive as described in the Wayback API spec here: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit?pli=1

They are optional, so it falls back to default unauthenticated saves

palewire commented 1 year ago

Love patch. Here's my picky list of picky stuff. If we get this stuff in I'm ready to merge.

duckduckgrayduck commented 1 year ago

I've added the unit test(which will only pass when savepagenow is repackaged, because it is being imported as a library into tests and doesn't have access to the new method yet), changed the user agent back to savepagenow, added documentation (had to change to a new version of sphinx napoleon in order to do so and this resulted in a lot of files 'changed', added custom error messages and ran black and pylint. should be ready for review.

palewire commented 1 year ago

I made a few tweaks and merged this in. Mainly I'd like to have more specific env variable names. The rest of my changes are all gloss.

Can you point me to where you sourced the 4 vs 12 request limit facts?

duckduckgrayduck commented 1 year ago

https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit Page 8 Max captures per minute for authenticated users = 12 and for anonymous users = 4.

palewire commented 1 year ago

Thanks. We should be out as version 1.3.0. Give it a try. Thanks again.

duckduckgrayduck commented 1 year ago

Works great! Thanks again.

overcast07 commented 10 months ago

https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit Page 8 Max captures per minute for authenticated users = 12 and for anonymous users = 4.

The document has been out of date for a while. It seems they didn't update the document to reflect it (it was around May that it occurred), but they changed the limit for authenticated users to 6 per minute, and for anonymous users to 3.

duckduckgrayduck commented 10 months ago

@overcast07 I'm not sure where you got those numbers. I've been in direct communication with the Internet Archive folks.

overcast07 commented 10 months ago

@overcast07 I'm not sure where you got those numbers. I've been in direct communication with the Internet Archive folks.

I created and frequently use a Bash script that can submit a list of URLs to Save Page Now, both with and without authentication. I haven't been in contact with the Internet Archive about it (I just didn't have much of a reason to) and they have never tried to contact me.

In my testing, it has been impossible to submit more than 6 URLs per minute for several months. The script submits URLs as frequently as every 3 seconds, and has done this for about 2 years, so it was quite noticeable when there suddenly started being a long gap between successful URL submissions after every 6th URL. Previously, the actual limit was probably 12 URLs, but it wasn't calculated in the same way until earlier this year (you could submit more than 12 URLs per minute by submitting them rapidly before the first one started processing), and shortly after they fixed this the limit was reduced to 6.

The website provides an endpoint (https://web.archive.org/save/status/user) which tells you if you don't have any slots left to use. The Bash script (since May 2023) uses the data that when authenticated to check if the site will return the "You have already reached the limit of active Save Page Now sessions" message for the next URL submitted, to avoid repeatedly receiving that error message.

duckduckgrayduck commented 9 months ago

Hey @overcast07, I've contacted the Internet Archive team and just want you to know that you are correct. I've made a new PR to update the documentation for savepagenow: https://github.com/palewire/savepagenow/pull/48