palewire / archiveis

A simple Python wrapper for the archive.is capturing service
https://archive.is
MIT License
183 stars 16 forks source link

Behaviour of the `/submit/` endpoint #22

Open antonalekseev opened 4 years ago

antonalekseev commented 4 years ago

Behaviour of https://archive.md/submit/ endpoint has changed recently. Now it returns WIP page in Refresh header (https://archive.md/wip/Z6uhm) which contains page capture progress and expects client to retry until the page is captured and proper memento URL (https://archive.md/Z6uhm) returned via Location. This way archiveis.capture() always returns URL of the WIP page.

This can be fixed either by retrying until proper URL is available (and somehow handling errors if it is not) or just stripping /wip/ from URL and hoping for the best.

>>> archive_url = archiveis.capture("https://example.com")
DEBUG:archiveis.api:Requesting https://archive.md/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "GET / HTTP/1.1" 200 4997
DEBUG:archiveis.api:Unique identifier: QxbCURgTX9qqOlJsvO7Qnp6OpwoRYUx3YErVZz1eLx4aUht3+iuOB+6Ili4WD2Y2
DEBUG:archiveis.api:Requesting https://archive.md/submit/
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): archive.md:443
DEBUG:urllib3.connectionpool:https://archive.md:443 "POST /submit/ HTTP/1.1" 200 244
DEBUG:archiveis.api:Memento from Refresh header: https://archive.md/wip/Z6uhm
palewire commented 4 years ago

Do you think stripping the /wip/ will work reliably?

antonalekseev commented 4 years ago

I reckon it will not be more unreliable than it was with current archiveis code and old-style (pre-wip-page) handling on the server side. Refresh: header was available as soon as Loading... page was, and it was returned by archiveis.capture() immediately and unconditionally. This way unsuccessful archivals in the cases of Error: time out., Error: Network error. and infinite Loading... were not handled anyway, and resulting link ultimately yielded 404. Stripping /wip/ should work the same way.

On the one hand bluntly ignoring errors is not an ideal approach, on the other hand waiting up to 3-5 minutes on each call is also not an option for many use cases. Maybe it makes sense to introduce some kind of archiveis.capture(..., strict=False) parameter which defaults to shortcut (and existing) behaviour, and optional strict=True mode which parses wip page for all kinds of errors and raises exceptions?

palewire commented 3 years ago

You have any idea on how we could implement this in the Python?