Support for the v2 POST API

palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service

https://palewi.re/docs/savepagenow/

MIT License

167 stars 23 forks source link

Support for the v2 POST API #22

Closed Mr0grog closed 3 years ago

Mr0grog commented 4 years ago

Do you have any interest in supporting the v2 POST API? It requires authentication (use your “S3-like API credentials” from https://archive.org/account/s3.php), but has a lot of super useful features.

I’ve been poking at it a fair amount lately, and would be happy to try and help add support for it here if you’re interested.

Mr0grog commented 4 years ago

(Update: looks like they are concerned about extra load on that API under the current situation, so I dropped the docs link.)

vegetableman commented 4 years ago

@Mr0grog Would appreciate if you could tell me the difference between v1 and v2 POST api besides the outlinks?. Can't find the api docs.

Mr0grog commented 4 years ago

Can't find the api docs.

@vegetableman see above comment — docs are not yet public, so I had to remove the link. I got ahead of myself here and thought they were ready for broader use after this blog article: https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/

Would appreciate if you could tell me the difference between v1 and v2 POST api

The POST-based API has some pretty fancy features. You can:

Poll for completion or just make the request to save and forget it
Save a screenshot
Save error pages (normally it only saves 2xx responses)
Set a time limit
Set cookies to use when requesting the page to save
Set basic auth credentials to use when requesting the page to save
Get a huge amount of useful metadata about the saved result

vegetableman commented 4 years ago

@Mr0grog Thanks Rob for sharing the details 👍. I was able to dig some details on the new api through devtools on the spn page, specifically, save/status/<jobId> that fetches the save completion status through jobId.

Although I do get the new snapshot details for a url, the issue is, even after the completion of the job, the new snapshot details are still not readily available for the url through this api: https://archive.org/wayback/available?url=<url>

It takes about ~ 10 - 30 minutes or more for the details to be available.

Mr0grog commented 4 years ago

Hmmmm, my experience has been that the availability time is about the same as an old-style GET request to SPN. (But I’ve been using the CDX API, not the availability API.) Are you sure it’s not just that SPN has been under very heavy load since the current coronavirus situation started?

vegetableman commented 4 years ago

Alright. So, the CDX api /search/cdx?url= is what I should have been using. Unlike the available api, I am getting the new snapshot details immediately through it. Thanks Rob 👍🙂.

vegetableman commented 4 years ago

I don't think the virus situation has anything to do with this. Also, I don't think SPN is at fault. My guess is, the data source for the availability api is updated through a queue of some sort. Where as, the data source for CDX is updated immediately .

palewire commented 4 years ago

I'd be open to including such a thing, but I'm clearly pretty far behind you on the learning curve. If you had time to prepare a pull request I'd be open to it.

palewire commented 3 years ago

Like I said in July, I'd take a pull request if someone had one. In the meantime, I'm going to close this ticket as stale.

Mr0grog commented 3 years ago

@palewire Sorry for the radio silence! This was the kick in the pants I needed to check back in with folks at the Internet Archive and they are 👍 on it now, so I will try and find some time to start implementing.

Mr0grog commented 3 years ago

OK, I posted a draft of this as #31. It probably gets the job 80% done, but needs some feedback before spending a whole lot more time on it.