What are the differences between archive.is, archive.org and webcitation.org?

orschiro commented 7 years ago

Dear @rahiel,

When is it best to use which?

rahiel commented 7 years ago

Dear @orschiro,

In the readme I've listed some data, like when the archives were launched and if they respect robots.txt or not. Their launch date shows how long they've been able to operate, and gives an indication if you can trust them to still be around years later. Webcitation.org and archive.org are the oldest, and I expect them to stay alive. Archive.is is privately funded by the generous owner, while archive.org is a registered non-profit.

Of the three only archive.is disregards the robots.txt, this means it can archive any page online, archive.org won't archive pages that the website owner has requested robots to ignore. Archive.org also scans the robots.txt after the archiving took place, and tries to respect the current one. This means that links that were once archived, could be deleted from archive.org on a later date. This might change in the future. Webcitation.org only respects robots.txt at the time of archiving, so it won't disappear in the future. (Archives can also disappear when the archiving service receives a DMCA, or when website owners ask archive.org directly to not archive their website.)

Another aspect is how links are captured: they use different techniques for this and the actual archives can look/behave differently. For example, I remember an archive looking fine on archive.is but with a broken layout on archive.org. This depends on the links itself, so you can only see the difference by checking.

You also don't need to choose: at the options you can now select multiple archiving services, then when you click on the Archiveror button, it will archive the page on all archives. Having the content at more places improves survivability.

For scientific research and other (physical) published content I advise to use multiple archives, as citations are worthless if they're unavailable.

At the end of the day you'll have to take everything into considerations and decide which is best for your use case.

orschiro commented 7 years ago

@rahiel thank you so much for your extensive response.

I think this is brilliant.

Especially this part:

You also don't need to choose: at the options you can now select multiple archiving services, then when you click on the Archiveror button, it will archive the page on all archives. Having the content at more places improves survivability.

elvey commented 3 years ago

Thanks. I just noticed that none of them will archive some pages, such as https://www.researchgate.net/publication/351015221_Home_treatment_of_mildmoderate_COVID-19_To_prevent_severe_disease_IMAGINE/comments And most troublingly, when a manual archive request is made archive.[is, .org] both fail, but it's not apparent. They both follow a redirect to the page without the "/comments" and back that up; archive.org explicitly, incorrectly reports success; archive.today does implicitly. (FYI, this page is the first search result for "archive.today vs archive.org"; sorry for any noise.)

Lieutenant-L-T-Smash commented 2 years ago

@elvey This isn't really the archive's fault. What's happening here is that the researchgate server is actually responding to the page request with a 301 redirect (i.e. "The page you're looking for isn't here, it's been moved to this other place, so look there"). It appears to be doing this when there isn't a referrer (i.e. when the comments page address is typed in directly, rather than being accessed through a link). Why this happens is a mystery to me. The server admins at researchgate decided this is how it should work.

You can reproduce this behavior by starting up a private browser window, and pasting that address directly into the address field. You'll be taken to the article's main page, not the comments page.

The archive service isn't at fault for believing the server when it answers "Not here, look over there". It could potentially try to fake a referrer but that's a little unseemly and could screw up statistics for other sites (who check the referrer to see where their traffic is coming from), which would make archive services unwelcome.

elvey commented 2 years ago

Thanks. I thought you had helped me find a workaround - but it didn't work. I asked https://web.archive.org to save the parent page, WITH

Saving outlinks and their embedded resources selected. I thought would do the trick and save the comments page, but no luck. I also tried editing a web page of mine that linked to the .../comments page and asking it to save that with outlines. It looked like it was working - I THINK the page showed that it was fetching it - not certain and can't redo the test easily - but the final page showed only /web/20220106015743/https://www.researchgate.net/publication/351015221_Home_treatment_of_mildmoderate_COVID-19_To_prevent_severe_disease_IMAGINE and https://web.archive.org/web/20220301000000*/https://www.researchgate.net/publication/351015221_Home_treatment_of_mildmoderate_COVID-19_To_prevent_severe_disease_IMAGINE/comments shows green - indicating it redirected. So either the Wayback Machine isn't sending a referer string, or RG is out-dumbing it.

Hmm... but what if I make a page that redirects to the /comments page....

rahiel / archiveror

What are the differences between archive.is, archive.org and webcitation.org? #13