Closed pabs3 closed 3 weeks ago
Ah! I was looking for other planets, url-sources is a great list. Thank you!
I'd love to help blogs get archived. However, I'd like to be respectful of blogs that don't wish to be archived. Is there any opt-out mechanism used by @ArchiveTeam? I believe IA uses the noarchive
meta tag but I didn't see any docs for ArchiveTeam.
I'm over thinking this for RSS/OPML files. ArchiveTeam can use various methods to discover feeds. Opt-outs are between the site and the archiver. I'll preserve noarchive
on any HTML pages I generate, but otherwise I don't think I need to get involved.
I've asked the relevant ArchiveTeam folks about this and there are some mechanisms that can work:
noarchive
ones.Personally I think item 4 is the only way for sites to really ensure they don't get published on archive.org, and item 1 is the best option for this situation.
Awesome, thanks for that feedback. I've implemented item 1. @pabs3 can you take a look to see if the new archive focused RSS feed will work for you?
For completeness I've also added:
It looks like all the feeds I've found that use noarchive
also use noindex
. I don't save any feeds that use noindex
so these lists currently have the same links.
Looks good, thanks a lot! I've added rss-blogroll-network
to my TODO list for my next batch of submissions to @ArchiveTeam urls-sources
.
The @ArchiveTeam urls-sources project aims to regularly save web resources and links found in them to archive.org. I was considering adding the Blogroll Network Map to the list of blog aggregators currently being archived, but currently there isn't a page suitable for use by that project. Would it be possible for you to export an RSS/Atom feed containing all links to the recent blog posts discovered by your crawler? The urls-sources project would then regularly download that and visit new URLs found in it. The Blogroll Network Map probably should also have its own blogroll OPML file too 😉 (though that wouldn't be used here)?