Open Kreijstal opened 4 years ago
[Write what technology is relevant. What language, what platform, any particular library/framework/existing project it is based on?]
) from your message. I have written a GitHub Action that can close the issues having issues like this one automatically (also it labels the issues automatically, and more features that are planned are moving the issues into other repo, deleting them and even banning authors who have exceeded the threshold of unfixed invalid issues). It is likely be added into this repo soon. Though it is not so trivial to make it do the needed stuff itself automatically. Also IMHO modifying someone else messages is an extreme measure that shouldn't be used with the situations like this one.archive.org
.1. @Kreijstal, you must delete all the traces of the unfilled template (like `[Write what technology is relevant. What language, what platform, any particular library/framework/existing project it is based on?]`) from your message. I have written a GitHub Action that can close the issues having issues like this one automatically (also it labels the issues automatically, and more features that are planned are moving the issues into other repo, deleting them and even banning authors who have exceeded the threshold of unfixed invalid issues). It is likely be added into this repo soon. Though it is not so trivial to make it do the needed stuff itself automatically. Also IMHO modifying someone else messages is an extreme measure that shouldn't be used with the situations like this one.
The template should probably have those strings as comments like this <!-- comment -->
I never knew I had to delete them, so I though it was fine, but thank you.
3. [they have recently introduced a feature that crawls websites automatically like wget does.](https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/) You just drop an URI and Wayback Machine crawls it itself and saves what is missing.
Also, yes, but that feature only goes 1 level deep, which is fine, but not exhaustive enough for some websites. I don't mean to spam archive.org with save requests, but I can be patient, and let it slowly crawl until it fills every corner of an old website.
Also, yes, but that feature only goes 1 level deep, which is fine, but not exhaustive enough for some websites. I don't mean to spam archive.org with save requests, but I can be patient, and let it slowly crawl until it fills every corner of an old website.
Then we definitely need such an app. The measures of WA are draconian indeed. Earlier it was possible to use wget to crawl the websites, then collect only the needed URIs using grep, then rewrite them to W.A. uris using sed, then use wget again to archive them to WA, and do everything within a few minutes (assumming each item to archive is from 15 to 30 MiBs and there are not so much of such items).
Hi, I had a crack at doing this and although I'm still looking for ways to improve it, I believe it already achieves the functionality @Kreijstal was looking for.
You can check it out here Archiver.
Thank you for the idea!
@Kreijstal would you say this fulfills the requirements for the issue?
@FredrikAugust it has the idea https://github.com/apurvmishra99/archiver/issues/3
So can this be closed @Kreijstal ?
Project description
Wayback machine is a great resource but sometimes it doesn't have a complete archive of a website and it doesnt crawl all those little websites where some gems might be hidden, but we can help it, what if we crawl a website ourselves, see which links are up to date with the wayback machine, if it is ignore it, but if it isn't up to date or if wayback machine hasn't archived it, just tell wayback machine to archive it. This would help preserve websites that we care about.
Okay, so it turns out that you can for example wget a website recursively which is fine, but you can not get just the urls you would have to download the entire website. Maybe we can start with an application that crawls urls
Relevant Technology
This can be achieved with any scripting language. wget, curl
Complexity and required time
Complexity
Required time (ETA)
Categories