Investigate crawling sitemaps

philbudne commented 6 months ago

This issue is for discussion/documentation of crawling site maps for story URL discovery.

I'm creating it in the rss-fetcher repo, since I think:

We will likely want the result to feed into the daily synthetic RSS file
The additions to the existing rss-fetcher infrastructure are likely to be small, and would benefit from the existing rss-fetcher infrastructure

Past work on sitemap parsing exists in https://github.com/mediacloud/ultimate-sitemap-parser (so the issue could be there) but my attempts at using it are that it:

always fetches the ENTIRE tree at one go (which can take a VERY long time), as opposed to incrementally fetching pages and creating a database of sitemap page URLs which we'll end up polling at different rates
is oriented towards building an abstract tree of obects representing the site, and is (at least initially) hard to understand, so I have questions on how good a starting place it is for this work (although it CERTAINLY contains bits of code that can be reused).

Or the issue could be in story-indexer (since it's the big/active repo).

philbudne commented 6 months ago

What might modifications to ultimate-sitemap-parser look like?

add an optional argument: either a simple boolean to prevent ANY recursion, or an optional integer on allowable recursion depth.
when recursion limit reached, instead of fetching/parsing the referenced document create a "stub" object representing the unfetched page?

philbudne commented 6 months ago

continuing...

fetching/parsing is done by the sitemap() virtual method of AbstractSitemapParser of which there are many subclasses/implementations:

class IndexRobotsTxtSitemapParser(AbstractSitemapParser):
class PlainTextSitemapParser(AbstractSitemapParser):
class XMLSitemapParser(AbstractSitemapParser):
class IndexXMLSitemapParser(AbstractXMLSitemapParser):
class PagesXMLSitemapParser(AbstractXMLSitemapParser):
class PagesRSSSitemapParser(AbstractXMLSitemapParser):
class PagesAtomSitemapParser(AbstractXMLSitemapParser):

philbudne commented 5 months ago

(Start of) Summary of issues:

Sitemap pages come in two flavors: one with urls of other sitemap pages: <sitemapindex>, one with urls of content: <urlset>
Sitemaps can be big, really big (one page per day going back 10 years)
It would be a pain to poll all of them with any frequency (it can take hours to fully traverse a site)
Many pages are likely to be static (or at least not yield any new links)
Increases need/desire to keep ALL urls referenced to avoid fetching dups if an index page changes
Index pages URLs are analogous to RSS page URLs; the less often they change, the less often we can poll them, but since the pages never roll off data, MUCH longer poll periods are practical.
Sitemaps are more likely to include EVERYTHING on the site than RSS feeds, including pages that are not "news"
Google defines additional tags for news: <news:news> and they suggest only publishing that metadata for two days: https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap
"root" sitemap page URLs can be published in robots.txt (treat robots as a page to poll, or only read when a source is (re)scraped?)

philbudne commented 5 months ago

Site	ctry	RSS?	Sitemap?
reuters.com	US	no	YES
bloomberg.com	US	no	YES + google ext.
buzzfeed.com	US	dead	YES
nationalpost.com	CA	no	NO
afp.com	FR	no	NO
itv.com	UK	no	???
thetimes.com	UK	no	YES
kbc.co.ke	KE	yes	YES
ajc.com (atl)	US	no	YES + google
sfchronicle.com	US	no	NO??

NOTES: itv.com: not working w/ requests?! sfchronicle.com: notes say yes, all recent runs say no

pgulley commented 5 months ago

Thank you for this! Trying to synthesize this into a design question, the key big picture thing that's standing out to me is that a sitemap parser would have a few key additional affordances that don't apply in the case of the rss fetcher:

Each individual domain might require some additional bespoke work to better filter what we actually pass to the story-indexer. Some, with the google extension, are easier than others.
We can only pay attention to a very limited number of sitemap domains at any given time, and we might want to change or re-prioritize that list based on research needs, so we would need some user-facing piece to facilitate that

I'm also reading that scheduling and de-duping are both going to involve a different approach than in the rss-fetcher

Is that a fair gloss?

philbudne commented 5 months ago

Yes, I think you have the gist.

My first observation was that "to a zeroth order approximation" (for what little that's worth), a sitemap page (of either kind) could be viewed as an RSS page to poll. Whether this means a single feeds table or not is an open question. feeds in the rss-fetcher are kept in 1-to-1 correspondence with feeds in the web-search (mcweb) database.

A "bells and whistles" implementation might be that a human using a web UI:

Scrapes a site: sitemaps added as "provisional", stories collected, but not sent to story-indexer
Human views stories, adds filters (to a table), until happy
Human marks source as live

And that initially, the above is done manually, for a few sources, to better until we learn about the (currently) unknown unknowns.

And finally, filtering for sites that have LONG backlogs of old articles (decades) there's the question of whether we can eliminate "index" pages (feeds) for historical news we don't want to fetch, lightening our polling load, and/or whether it's possible to filter on article URL alone (requires that the date appear in the article URL).

I think the rss-fetcher polling infrastructure is adaptable, tho again, whether sitemap fetching and rss fetching should be in a single process (in case a site has both, and we want to be kind across page types) or not is open...

Code-wise sitemap parsing is simple/small, and the scheduling infrastructure needs are very similar (with differennt parameters like min and max interval).

Google news tags are (hopefully) a good indicator of links that should be considered news (and might even not require human intervention to vet the links), BUT the if the sites follows the best practices above the pages would be more volatile (tags kept for a limited time only), and require faster polling than other sitemap pages.

And finally, yes, the current "low end" dedup we currently perform is likely to behave badly for a site that has sitemap pages for historical articles that are not COMPLETELY static (ie; change header data so their page checksum changes).

A "higher end" dedup database could be a MongoDB cluster co-resident on the ES servers. The minimum information would be an index (by initial URL) of trivial (size zero?) objects.

A middle path (now that we have ES running) would be another set of ES indices (managed similarly to the "search" indices) keyed by initial URL (with no searchable/indexed fields).

An enhanced version of either of the above would be to keep some status/timestamp information for each initial URL: instead of dropping stories in the pipeline, we could queue them to a worker that updates the status (and "fork" a copy after an article is indexed). The ability to have a pipeline worker output sent to multiple input queues is already built into the system....

Some past and recent thoughts about URL retention for dedup are in https://github.com/mediacloud/rss-fetcher/issues/25

philbudne commented 5 months ago

Some questions:

Use rss-fetcher framework? How? Fork project, work in branch, work in main?
Use ultimate-sitemap-parser (lean towards just extracting core parsing)
Data modeling: Use Feed tables (mcweb AND rss-fetcher)? ie; add "format" column (top level tag)?

philbudne commented 3 months ago

Many sites have a single google news tag enhanced (urlset) sitemap page, which is functionally equivalent to an rss feed:

contains only recent stories
has title and publish date
page is often referenced in robots.txt, or one of a few paths (ie; without performing full site crawl)

What I've done:

extracted small(ish) standalone parser from usp as https://github.com/mediacloud/sitemap-tools/blob/main/mc_sitemap_tools/parser.py
above parser now tried in fetcher if feedparser fails. code in production, hand added a few page URLs (incl. reuters)
Wrote https://github.com/mediacloud/sitemap-tools/blob/main/mc_sitemap_tools/discover.py
Wrote enhancements web-search scrape code to use above.

The most likely next step would be:

Adding full-site crawl to look for gnews urlsets to sitemap-tools
Initially making it available via a jupyter notebook

mediacloud / rss-fetcher

Investigate crawling sitemaps #41