ralexander-phi / rss-blogroll-network

https://alexsci.com/rss-blogroll-network/
Apache License 2.0
13 stars 1 forks source link

Add page and feed of all recent blog posts? #8

Closed pabs3 closed 3 weeks ago

pabs3 commented 1 month ago

The @ArchiveTeam urls-sources project aims to regularly save web resources and links found in them to archive.org. I was considering adding the Blogroll Network Map to the list of blog aggregators currently being archived, but currently there isn't a page suitable for use by that project. Would it be possible for you to export an RSS/Atom feed containing all links to the recent blog posts discovered by your crawler? The urls-sources project would then regularly download that and visit new URLs found in it. The Blogroll Network Map probably should also have its own blogroll OPML file too 😉 (though that wouldn't be used here)?

ralexander-phi commented 1 month ago

Ah! I was looking for other planets, url-sources is a great list. Thank you!

I'd love to help blogs get archived. However, I'd like to be respectful of blogs that don't wish to be archived. Is there any opt-out mechanism used by @ArchiveTeam? I believe IA uses the noarchive meta tag but I didn't see any docs for ArchiveTeam.

ralexander-phi commented 4 weeks ago

I'm over thinking this for RSS/OPML files. ArchiveTeam can use various methods to discover feeds. Opt-outs are between the site and the archiver. I'll preserve noarchive on any HTML pages I generate, but otherwise I don't think I need to get involved.

pabs3 commented 4 weeks ago

I've asked the relevant ArchiveTeam folks about this and there are some mechanisms that can work:

  1. You could provide separate RSS and OPML feeds that contain only the archivable content.
  2. ArchiveTeam can not use your feeds, and instead import all the individual feeds you import, manually check all of them and remove noarchive ones.
  3. The ArchiveTeam URLs project does have a site filter list, but that is used solely for technically problematic stuff, not non-archivable stuff. It also doesn't affect other ArchiveTeam projects like ArchiveBot where individual sites and their outlinks are archived.
  4. Individual sites can ask archive.org to not republish captures of their site. Captures can happen via various ArchiveTeam activity, archive.org Save Page Now, the archive.org crawler, Common Crawl, and other archiving groups that upload to archive.org too.

Personally I think item 4 is the only way for sites to really ensure they don't get published on archive.org, and item 1 is the best option for this situation.

ralexander-phi commented 4 weeks ago

Awesome, thanks for that feedback. I've implemented item 1. @pabs3 can you take a look to see if the new archive focused RSS feed will work for you?

For completeness I've also added:

It looks like all the feeds I've found that use noarchive also use noindex. I don't save any feeds that use noindex so these lists currently have the same links.

pabs3 commented 3 weeks ago

Looks good, thanks a lot! I've added rss-blogroll-network to my TODO list for my next batch of submissions to @ArchiveTeam urls-sources.