mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
6 stars 6 forks source link

Investigate crawling sitemaps #41

Closed philbudne closed 3 months ago

philbudne commented 6 months ago

This issue is for discussion/documentation of crawling site maps for story URL discovery.

I'm creating it in the rss-fetcher repo, since I think:

  1. We will likely want the result to feed into the daily synthetic RSS file
  2. The additions to the existing rss-fetcher infrastructure are likely to be small, and would benefit from the existing rss-fetcher infrastructure

Past work on sitemap parsing exists in https://github.com/mediacloud/ultimate-sitemap-parser (so the issue could be there) but my attempts at using it are that it:

  1. always fetches the ENTIRE tree at one go (which can take a VERY long time), as opposed to incrementally fetching pages and creating a database of sitemap page URLs which we'll end up polling at different rates
  2. is oriented towards building an abstract tree of obects representing the site, and is (at least initially) hard to understand, so I have questions on how good a starting place it is for this work (although it CERTAINLY contains bits of code that can be reused).

Or the issue could be in story-indexer (since it's the big/active repo).

philbudne commented 6 months ago

What might modifications to ultimate-sitemap-parser look like?

  1. add an optional argument: either a simple boolean to prevent ANY recursion, or an optional integer on allowable recursion depth.
  2. when recursion limit reached, instead of fetching/parsing the referenced document create a "stub" object representing the unfetched page?
philbudne commented 6 months ago

continuing...

fetching/parsing is done by the sitemap() virtual method of AbstractSitemapParser of which there are many subclasses/implementations:

class IndexRobotsTxtSitemapParser(AbstractSitemapParser):
class PlainTextSitemapParser(AbstractSitemapParser):
class XMLSitemapParser(AbstractSitemapParser):
class IndexXMLSitemapParser(AbstractXMLSitemapParser):
class PagesXMLSitemapParser(AbstractXMLSitemapParser):
class PagesRSSSitemapParser(AbstractXMLSitemapParser):
class PagesAtomSitemapParser(AbstractXMLSitemapParser):
philbudne commented 5 months ago

(Start of) Summary of issues:

  1. Sitemap pages come in two flavors: one with urls of other sitemap pages: <sitemapindex>, one with urls of content: <urlset>
  2. Sitemaps can be big, really big (one page per day going back 10 years)
  3. It would be a pain to poll all of them with any frequency (it can take hours to fully traverse a site)
  4. Many pages are likely to be static (or at least not yield any new links)
  5. Increases need/desire to keep ALL urls referenced to avoid fetching dups if an index page changes
  6. Index pages URLs are analogous to RSS page URLs; the less often they change, the less often we can poll them, but since the pages never roll off data, MUCH longer poll periods are practical.
  7. Sitemaps are more likely to include EVERYTHING on the site than RSS feeds, including pages that are not "news"
  8. Google defines additional tags for news: <news:news> and they suggest only publishing that metadata for two days: https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap
  9. "root" sitemap page URLs can be published in robots.txt (treat robots as a page to poll, or only read when a source is (re)scraped?)
philbudne commented 5 months ago
Site ctry RSS? Sitemap?
reuters.com US no YES
bloomberg.com US no YES + google ext.
buzzfeed.com US dead YES
nationalpost.com CA no NO
afp.com FR no NO
itv.com UK no ???
thetimes.com UK no YES
kbc.co.ke KE yes YES
ajc.com (atl) US no YES + google
sfchronicle.com US no NO??

NOTES: itv.com: not working w/ requests?! sfchronicle.com: notes say yes, all recent runs say no

pgulley commented 5 months ago

Thank you for this! Trying to synthesize this into a design question, the key big picture thing that's standing out to me is that a sitemap parser would have a few key additional affordances that don't apply in the case of the rss fetcher:

  1. Each individual domain might require some additional bespoke work to better filter what we actually pass to the story-indexer. Some, with the google extension, are easier than others.
  2. We can only pay attention to a very limited number of sitemap domains at any given time, and we might want to change or re-prioritize that list based on research needs, so we would need some user-facing piece to facilitate that

I'm also reading that scheduling and de-duping are both going to involve a different approach than in the rss-fetcher

Is that a fair gloss?

philbudne commented 5 months ago

Yes, I think you have the gist.

My first observation was that "to a zeroth order approximation" (for what little that's worth), a sitemap page (of either kind) could be viewed as an RSS page to poll. Whether this means a single feeds table or not is an open question. feeds in the rss-fetcher are kept in 1-to-1 correspondence with feeds in the web-search (mcweb) database.

A "bells and whistles" implementation might be that a human using a web UI:

  1. Scrapes a site: sitemaps added as "provisional", stories collected, but not sent to story-indexer
  2. Human views stories, adds filters (to a table), until happy
  3. Human marks source as live

And that initially, the above is done manually, for a few sources, to better until we learn about the (currently) unknown unknowns.

And finally, filtering for sites that have LONG backlogs of old articles (decades) there's the question of whether we can eliminate "index" pages (feeds) for historical news we don't want to fetch, lightening our polling load, and/or whether it's possible to filter on article URL alone (requires that the date appear in the article URL).

I think the rss-fetcher polling infrastructure is adaptable, tho again, whether sitemap fetching and rss fetching should be in a single process (in case a site has both, and we want to be kind across page types) or not is open...

Code-wise sitemap parsing is simple/small, and the scheduling infrastructure needs are very similar (with differennt parameters like min and max interval).

Google news tags are (hopefully) a good indicator of links that should be considered news (and might even not require human intervention to vet the links), BUT the if the sites follows the best practices above the pages would be more volatile (tags kept for a limited time only), and require faster polling than other sitemap pages.

And finally, yes, the current "low end" dedup we currently perform is likely to behave badly for a site that has sitemap pages for historical articles that are not COMPLETELY static (ie; change header data so their page checksum changes).

A "higher end" dedup database could be a MongoDB cluster co-resident on the ES servers. The minimum information would be an index (by initial URL) of trivial (size zero?) objects.

A middle path (now that we have ES running) would be another set of ES indices (managed similarly to the "search" indices) keyed by initial URL (with no searchable/indexed fields).

An enhanced version of either of the above would be to keep some status/timestamp information for each initial URL: instead of dropping stories in the pipeline, we could queue them to a worker that updates the status (and "fork" a copy after an article is indexed). The ability to have a pipeline worker output sent to multiple input queues is already built into the system....

Some past and recent thoughts about URL retention for dedup are in https://github.com/mediacloud/rss-fetcher/issues/25

philbudne commented 5 months ago

Some questions:

philbudne commented 3 months ago

Many sites have a single google news tag enhanced (urlset) sitemap page, which is functionally equivalent to an rss feed:

  1. contains only recent stories
  2. has title and publish date
  3. page is often referenced in robots.txt, or one of a few paths (ie; without performing full site crawl)

What I've done:

  1. extracted small(ish) standalone parser from usp as https://github.com/mediacloud/sitemap-tools/blob/main/mc_sitemap_tools/parser.py
  2. above parser now tried in fetcher if feedparser fails. code in production, hand added a few page URLs (incl. reuters)
  3. Wrote https://github.com/mediacloud/sitemap-tools/blob/main/mc_sitemap_tools/discover.py
  4. Wrote enhancements web-search scrape code to use above.

The most likely next step would be:

  1. Adding full-site crawl to look for gnews urlsets to sitemap-tools
  2. Initially making it available via a jupyter notebook