Add support for Save Page Now v2 API

Mr0grog commented 3 years ago

This is a first draft of support for the new(ish) v2 Save Page Now API, but it’s not really ready to merge. This could probably use some deeper thought on function naming, more detailed exception types, CLI support. I’d also appreciate some general feedback on the direction here.

Background: Save Page Now shipped v2 a little while ago, and it now has a proper API to use rather than just acting like a browser and requesting https://web.archive.org/save/<url>. It allows you to configure a number of useful features like cookies, login information, etc., but requires authentication. In this commit, I’ve added support for it as a separate capture_v2() function so as not to break existing users of this package who won’t have authentication configured. Official documentation can be found at: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA (Yes, this is the official link.)

Authorization is done through the Internet Archive’s “S3-like” keys (find yours at https://archive.org/account/s3.php). You can set them as function arguments or as the environment variables IAS3_ACCESS_KEY and IAS3_SECRET_KEY.

The API is based around a queue: you make one call to enqueue a capture job, then poll the status endpoint until the job has completed. I’ve broken the implementation up into one function for each of those calls plus a wrapper than handles the whole process. A user might want access to the lower-level functions in order to start parallel captures.

Some notes and questions:

I’ve used a few Python 3.6+ features since it looks like that’s the minimum requirements based on the classifiers in setup.py and the Pipfile says 3.8+. Let me know if this actually needs to support older Python versions.
I added a new BlockedUrl exception since the API returns information about URL blocking that isn’t necessarily caused by robots.txt files.
There is a lot of detailed error information in the API, so it seems like we could add some more detailed exception types. At the very least, it might be useful to differentiate errors contacting the requested URL vs. errors on Wayback’s part.
I added some more contextual information to the exceptions in an ad-hoc way, but it would probably be better to modify the exception classes to directly support that info.
I’m not sure the function names I’ve chosen here are great. First off, v2 is not really a differentiator — capture() is also using v2, but through the browser interface instead of the API. queue_capture_* and get_status_* also don’t feel especially cohesive to me.
I’ve used the same parameter names and default values as the API. Not sure if that feels ideal. I think it might be a good idea to require everything but target_url to be a keyword argument.
Credential arguments come after user_agent and accept_cache to match the existing API. it feels a little awkward to place them there, though, since they are critical to set. Maybe I should have put them at the end.
I haven’t yet added CLI support. Should it just automatically use the API if S3-like keys are set, or should someone have to explicitly opt-in via a CLI option?

Any other feedback is very welcome!

Fixes #22.

Mr0grog commented 3 years ago

Test failures are because access keys are required for the new API. They should probably be set as repo secrets (or we could use VCR or something similar to mock out the requests and responses).

overcast07 commented 3 years ago

Official documentation can be found at: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA (Yes, this is the official link.)

(I realize that this isn't directly related to this pull request, but having written an entire Bash script for working with Save Page Now v2, I'm kind of dumbfounded that this page is how I found out about the documentation ten minutes ago, since it doesn't appear to be linked to from anywhere on archive.org.)

I guess the Bash script might be tangentially useful for this pull request, though because I wrote it under the assumption that there was no API documentation at all there are some parts of it which could probably be executed in a better way using the API features that I wasn't aware of.

Mr0grog commented 3 years ago

I'm kind of dumbfounded that this page is how I found out about the documentation ten minutes ago, since it doesn't appear to be linked to from anywhere on archive.org.

Oof, I hear you there. (Also, amazing work on that bash script with no docs!) I wish the Internet Archive had more cohesive and complete documentation for all this kind of stuff. Lots isn't even documented at all. FWIW, I've had the most luck getting info through IA's Slack (internetarchive.slack.com).

Mr0grog commented 3 years ago

Ping! Just thought I'd check in here, since it's been sitting for more than half a year — would love to have any feedback on this, even if it's just "this over-complicates the tool and isn't worthwhile."

brandongalbraith commented 3 years ago

@Mr0grog If you don't get a response, I'm happy to help maintain a fork of savepagenow with these updates (I co-maintain https://github.com/bibanon/tubeup, so am familiar with interacting with IA infra). I use savepagenow myself, and would like to see these changes implemented as the current implementation is a bit dated and there's room for improvement.

palewire commented 3 years ago

Sorry I've been MIA. Seeing how complex and how different v2 is, I think it would be best for you to spin off your own thing. I doubt I have the time to tackle this. Thanks for considering it tho.

palewire / savepagenow

Add support for Save Page Now v2 API #31