open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.59k stars 221 forks source link

archive.org (Internet Archive) helper #219

Open Kreijstal opened 4 years ago

Kreijstal commented 4 years ago

Project description

Wayback machine is a great resource but sometimes it doesn't have a complete archive of a website and it doesnt crawl all those little websites where some gems might be hidden, but we can help it, what if we crawl a website ourselves, see which links are up to date with the wayback machine, if it is ignore it, but if it isn't up to date or if wayback machine hasn't archived it, just tell wayback machine to archive it. This would help preserve websites that we care about.

Okay, so it turns out that you can for example wget a website recursively which is fine, but you can not get just the urls you would have to download the entire website. Maybe we can start with an application that crawls urls

Relevant Technology

This can be achieved with any scripting language. wget, curl

Complexity and required time

Complexity

Required time (ETA)

Categories

KOLANICH commented 4 years ago
  1. @Kreijstal, you must delete all the traces of the unfilled template (like [Write what technology is relevant. What language, what platform, any particular library/framework/existing project it is based on?]) from your message. I have written a GitHub Action that can close the issues having issues like this one automatically (also it labels the issues automatically, and more features that are planned are moving the issues into other repo, deleting them and even banning authors who have exceeded the threshold of unfixed invalid issues). It is likely be added into this repo soon. Though it is not so trivial to make it do the needed stuff itself automatically. Also IMHO modifying someone else messages is an extreme measure that shouldn't be used with the situations like this one.
  2. The website is called archive.org.
  3. they have recently introduced a feature that crawls websites automatically like wget does. You just drop an URI and Wayback Machine crawls it itself and saves what is missing.
  4. What is not mentioned in their blog is that they have also introduced throttling. A very dumb and aggressive throttling.
Kreijstal commented 4 years ago
1. @Kreijstal, you must delete all the traces of the unfilled template (like `[Write what technology is relevant. What language, what platform, any particular library/framework/existing project it is based on?]`) from your message. I have written a GitHub Action that can close the issues having issues like this one automatically (also it labels the issues automatically, and more features that are planned are moving the issues into other repo, deleting them and even banning authors who have exceeded the threshold of unfixed invalid issues). It is likely be added into this repo soon. Though it is not so trivial to make it do the needed stuff itself automatically. Also IMHO modifying someone else messages is an extreme measure that shouldn't be used with the situations like this one.

The template should probably have those strings as comments like this <!-- comment --> I never knew I had to delete them, so I though it was fine, but thank you.

3. [they have recently introduced a feature that crawls websites automatically like wget does.](https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/) You just drop an URI and Wayback Machine crawls it itself and saves what is missing. 

Also, yes, but that feature only goes 1 level deep, which is fine, but not exhaustive enough for some websites. I don't mean to spam archive.org with save requests, but I can be patient, and let it slowly crawl until it fills every corner of an old website.

KOLANICH commented 4 years ago

Also, yes, but that feature only goes 1 level deep, which is fine, but not exhaustive enough for some websites. I don't mean to spam archive.org with save requests, but I can be patient, and let it slowly crawl until it fills every corner of an old website.

Then we definitely need such an app. The measures of WA are draconian indeed. Earlier it was possible to use wget to crawl the websites, then collect only the needed URIs using grep, then rewrite them to W.A. uris using sed, then use wget again to archive them to WA, and do everything within a few minutes (assumming each item to archive is from 15 to 30 MiBs and there are not so much of such items).

apurvmishra99 commented 4 years ago

Hi, I had a crack at doing this and although I'm still looking for ways to improve it, I believe it already achieves the functionality @Kreijstal was looking for.

You can check it out here Archiver.

Thank you for the idea!

FredrikAugust commented 4 years ago

@Kreijstal would you say this fulfills the requirements for the issue?

Kreijstal commented 4 years ago

@FredrikAugust it has the idea https://github.com/apurvmishra99/archiver/issues/3

FredrikAugust commented 4 years ago

So can this be closed @Kreijstal ?