Closed Mr0grog closed 3 years ago
Test failures are because access keys are required for the new API. They should probably be set as repo secrets (or we could use VCR or something similar to mock out the requests and responses).
Official documentation can be found at: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA (Yes, this is the official link.)
(I realize that this isn't directly related to this pull request, but having written an entire Bash script for working with Save Page Now v2, I'm kind of dumbfounded that this page is how I found out about the documentation ten minutes ago, since it doesn't appear to be linked to from anywhere on archive.org.)
I guess the Bash script might be tangentially useful for this pull request, though because I wrote it under the assumption that there was no API documentation at all there are some parts of it which could probably be executed in a better way using the API features that I wasn't aware of.
I'm kind of dumbfounded that this page is how I found out about the documentation ten minutes ago, since it doesn't appear to be linked to from anywhere on archive.org.
Oof, I hear you there. (Also, amazing work on that bash script with no docs!) I wish the Internet Archive had more cohesive and complete documentation for all this kind of stuff. Lots isn't even documented at all. FWIW, I've had the most luck getting info through IA's Slack (internetarchive.slack.com).
Ping! Just thought I'd check in here, since it's been sitting for more than half a year — would love to have any feedback on this, even if it's just "this over-complicates the tool and isn't worthwhile."
@Mr0grog If you don't get a response, I'm happy to help maintain a fork of savepagenow with these updates (I co-maintain https://github.com/bibanon/tubeup, so am familiar with interacting with IA infra). I use savepagenow myself, and would like to see these changes implemented as the current implementation is a bit dated and there's room for improvement.
Sorry I've been MIA. Seeing how complex and how different v2 is, I think it would be best for you to spin off your own thing. I doubt I have the time to tackle this. Thanks for considering it tho.
This is a first draft of support for the new(ish) v2 Save Page Now API, but it’s not really ready to merge. This could probably use some deeper thought on function naming, more detailed exception types, CLI support. I’d also appreciate some general feedback on the direction here.
Background: Save Page Now shipped v2 a little while ago, and it now has a proper API to use rather than just acting like a browser and requesting
https://web.archive.org/save/<url>
. It allows you to configure a number of useful features like cookies, login information, etc., but requires authentication. In this commit, I’ve added support for it as a separatecapture_v2()
function so as not to break existing users of this package who won’t have authentication configured. Official documentation can be found at: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA (Yes, this is the official link.)Authorization is done through the Internet Archive’s “S3-like” keys (find yours at https://archive.org/account/s3.php). You can set them as function arguments or as the environment variables
IAS3_ACCESS_KEY
andIAS3_SECRET_KEY
.The API is based around a queue: you make one call to enqueue a capture job, then poll the status endpoint until the job has completed. I’ve broken the implementation up into one function for each of those calls plus a wrapper than handles the whole process. A user might want access to the lower-level functions in order to start parallel captures.
Some notes and questions:
I’ve used a few Python 3.6+ features since it looks like that’s the minimum requirements based on the classifiers in
setup.py
and thePipfile
says 3.8+. Let me know if this actually needs to support older Python versions.I added a new
BlockedUrl
exception since the API returns information about URL blocking that isn’t necessarily caused byrobots.txt
files.There is a lot of detailed error information in the API, so it seems like we could add some more detailed exception types. At the very least, it might be useful to differentiate errors contacting the requested URL vs. errors on Wayback’s part.
I added some more contextual information to the exceptions in an ad-hoc way, but it would probably be better to modify the exception classes to directly support that info.
I’m not sure the function names I’ve chosen here are great. First off, v2 is not really a differentiator —
capture()
is also using v2, but through the browser interface instead of the API.queue_capture_*
andget_status_*
also don’t feel especially cohesive to me.I’ve used the same parameter names and default values as the API. Not sure if that feels ideal. I think it might be a good idea to require everything but
target_url
to be a keyword argument.Credential arguments come after
user_agent
andaccept_cache
to match the existing API. it feels a little awkward to place them there, though, since they are critical to set. Maybe I should have put them at the end.I haven’t yet added CLI support. Should it just automatically use the API if S3-like keys are set, or should someone have to explicitly opt-in via a CLI option?
Any other feedback is very welcome!
Fixes #22.