palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
https://palewi.re/docs/savepagenow/
MIT License
168 stars 23 forks source link

Add support for Save Page Now v2 API #31

Closed Mr0grog closed 3 years ago

Mr0grog commented 3 years ago

This is a first draft of support for the new(ish) v2 Save Page Now API, but it’s not really ready to merge. This could probably use some deeper thought on function naming, more detailed exception types, CLI support. I’d also appreciate some general feedback on the direction here.

Background: Save Page Now shipped v2 a little while ago, and it now has a proper API to use rather than just acting like a browser and requesting https://web.archive.org/save/<url>. It allows you to configure a number of useful features like cookies, login information, etc., but requires authentication. In this commit, I’ve added support for it as a separate capture_v2() function so as not to break existing users of this package who won’t have authentication configured. Official documentation can be found at: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA (Yes, this is the official link.)

Authorization is done through the Internet Archive’s “S3-like” keys (find yours at https://archive.org/account/s3.php). You can set them as function arguments or as the environment variables IAS3_ACCESS_KEY and IAS3_SECRET_KEY.

The API is based around a queue: you make one call to enqueue a capture job, then poll the status endpoint until the job has completed. I’ve broken the implementation up into one function for each of those calls plus a wrapper than handles the whole process. A user might want access to the lower-level functions in order to start parallel captures.

Some notes and questions:

Any other feedback is very welcome!

Fixes #22.

Mr0grog commented 3 years ago

Test failures are because access keys are required for the new API. They should probably be set as repo secrets (or we could use VCR or something similar to mock out the requests and responses).

overcast07 commented 3 years ago

Official documentation can be found at: https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA (Yes, this is the official link.)

(I realize that this isn't directly related to this pull request, but having written an entire Bash script for working with Save Page Now v2, I'm kind of dumbfounded that this page is how I found out about the documentation ten minutes ago, since it doesn't appear to be linked to from anywhere on archive.org.)

I guess the Bash script might be tangentially useful for this pull request, though because I wrote it under the assumption that there was no API documentation at all there are some parts of it which could probably be executed in a better way using the API features that I wasn't aware of.

Mr0grog commented 3 years ago

I'm kind of dumbfounded that this page is how I found out about the documentation ten minutes ago, since it doesn't appear to be linked to from anywhere on archive.org.

Oof, I hear you there. (Also, amazing work on that bash script with no docs!) I wish the Internet Archive had more cohesive and complete documentation for all this kind of stuff. Lots isn't even documented at all. FWIW, I've had the most luck getting info through IA's Slack (internetarchive.slack.com).

Mr0grog commented 3 years ago

Ping! Just thought I'd check in here, since it's been sitting for more than half a year — would love to have any feedback on this, even if it's just "this over-complicates the tool and isn't worthwhile."

brandongalbraith commented 3 years ago

@Mr0grog If you don't get a response, I'm happy to help maintain a fork of savepagenow with these updates (I co-maintain https://github.com/bibanon/tubeup, so am familiar with interacting with IA infra). I use savepagenow myself, and would like to see these changes implemented as the current implementation is a bit dated and there's room for improvement.

palewire commented 3 years ago

Sorry I've been MIA. Seeing how complex and how different v2 is, I think it would be best for you to spin off your own thing. I doubt I have the time to tackle this. Thanks for considering it tho.