palewire / savemy.news

Save My News: A personal, permanent clipping service
http://savemy.news
MIT License
26 stars 2 forks source link

Incorporate archivenow to push to more archives #1

Closed ibnesayeed closed 6 years ago

ibnesayeed commented 6 years ago

archivenow is a python library that allows implementation of pushing archival requests to various upstream archives in a modular way. It already has implementation of four archives, but extendable to many more. Archiving web pages in just a single archive is not safe either because web archives die (many have died in the past) and big archives are easy target of governments to block them or take them down. For example, the Internet Archive is blocked in Russia and China, additionally, it has been blocked for a while in India in the past.

archivenow repo: https://github.com/oduwsdl/archivenow

A blog post explaining archivenow: http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html

palewire commented 6 years ago

I'm familiar with these other archives (and have written wrappers on a couple of them myself). Great idea. I've even already modeled in some space for my database to grow to support this kind of thing. Any ideas on what the user interface would look like after it was integrated? Replace the third column of the HTML table with icons or shorter links to the mementos?

ibnesayeed commented 6 years ago

Any ideas on what the user interface would look like after it was integrated? Replace the third column of the HTML table with icons or shorter links to the mementos?

There are a few ways it can be dealt with. Having icons of the archives is one very good way, but I would not suggest link shortners as they add one more point of failure and increase the possibility of link rot. Another good approach would be to use an aggregator to point them to all the archived copies. If you use TimeTravel service for linking captures, you can do it without any service requests, all at the client side by putting the 14-digit datetime in the URI (the service will point to the closest copies when resolved).

Suppose I archive http://example.com now (at 20171107191613). I can construct a link to the closest available copy in any archive as http://timetravel.mementoweb.org/memento/20171107190513/http://example.com and a list of other nearby archival copies as http://timetravel.mementoweb.org/list/20171107190513/http://example.com. These links will be resolved lazily, so even if some archives are gone, these will point to the available ones only.

This latter approach is being used in the #ICanHazMemento Twitter bot.

/CC @phonedude @hvdsomp

ibnesayeed commented 6 years ago

@phonedude mentioned that Robust Links could be another way to enable lazy dereferencing.

palewire commented 6 years ago

This morning I added mirroring at webcitation.org on a going forward basis. I'm working on creating a task that loops back through the previously saved clips and mirrors them as well.

ibnesayeed commented 6 years ago

Are you planning to make that recurring archival period and frequency decision yourself or let the user choose while keeping some sensible defaults? Something like:

Revisit [every week] for [three months]

Whatever approach you choose, I would also suggest you to keep a trail of last N attempts. So that if a URI is continuously returning failure code, you can cool down on attempts to archive it.

palewire commented 6 years ago

I haven't thought this all the way through, but I'm thinking the archiving will be a one-time operation done at the time the link is submitted.

But as new archives are added, like today, I'm was thinking it would be a good to go back and mirror the ones already in my small database.

ibnesayeed commented 6 years ago

Ah OK, I thought you were thinking about recurring capture, but your plan is to perform a one-time capture of all the things you have archived so far, every time you add yet another archive. This might give you some consistency in the UI where all the archive icons will be present against all the URI-Rs, but then you will hit two other issues; 1) the time in the second column will not be representative of all the copies and 2) the URL might be down or completely gone by the time you add a new archive.

palewire commented 6 years ago

I totally agree with your assessment. If you have any idea for how to solve the situation, I'd appreciate your thoughts.

ibnesayeed commented 6 years ago

One way to solve these issues is to use something like the TimeTravel services (as described in an earlier comment) rather than listing all the archive icons and separately listing them. Alternatively, the second column's title can be changed to reflect more accurately what it represents (something like the time of the first attempt to capture). Additionally, all the icons can have a tooltip associated to reflect the exact Memento-Datetime they point to (note that these times might be off by a few seconds across various services, even if the request was made to all of them in parallel).

palewire commented 6 years ago

archive.is has been implemented for new clips. I'm slowly rolling back through the old ones to catch them up.