Closed philbudne closed 3 months ago
What might modifications to ultimate-sitemap-parser look like?
continuing...
fetching/parsing is done by the sitemap()
virtual method of AbstractSitemapParser
of which there are many subclasses/implementations:
class IndexRobotsTxtSitemapParser(AbstractSitemapParser):
class PlainTextSitemapParser(AbstractSitemapParser):
class XMLSitemapParser(AbstractSitemapParser):
class IndexXMLSitemapParser(AbstractXMLSitemapParser):
class PagesXMLSitemapParser(AbstractXMLSitemapParser):
class PagesRSSSitemapParser(AbstractXMLSitemapParser):
class PagesAtomSitemapParser(AbstractXMLSitemapParser):
(Start of) Summary of issues:
<sitemapindex>
, one with urls of content: <urlset>
<news:news>
and they suggest only publishing that metadata for two days: https://developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemaprobots.txt
(treat robots as a page to poll, or only read when a source is (re)scraped?)Site | ctry | RSS? | Sitemap? |
---|---|---|---|
reuters.com | US | no | YES |
bloomberg.com | US | no | YES + google ext. |
buzzfeed.com | US | dead | YES |
nationalpost.com | CA | no | NO |
afp.com | FR | no | NO |
itv.com | UK | no | ??? |
thetimes.com | UK | no | YES |
kbc.co.ke | KE | yes | YES |
ajc.com (atl) | US | no | YES + google |
sfchronicle.com | US | no | NO?? |
NOTES: itv.com: not working w/ requests?! sfchronicle.com: notes say yes, all recent runs say no
Thank you for this! Trying to synthesize this into a design question, the key big picture thing that's standing out to me is that a sitemap parser would have a few key additional affordances that don't apply in the case of the rss fetcher:
I'm also reading that scheduling and de-duping are both going to involve a different approach than in the rss-fetcher
Is that a fair gloss?
Yes, I think you have the gist.
My first observation was that "to a zeroth order approximation" (for what little that's worth), a sitemap page (of either kind) could be viewed as an RSS page to poll. Whether this means a single feeds table or not is an open question. feeds in the rss-fetcher are kept in 1-to-1 correspondence with feeds in the web-search (mcweb) database.
A "bells and whistles" implementation might be that a human using a web UI:
And that initially, the above is done manually, for a few sources, to better until we learn about the (currently) unknown unknowns.
And finally, filtering for sites that have LONG backlogs of old articles (decades) there's the question of whether we can eliminate "index" pages (feeds) for historical news we don't want to fetch, lightening our polling load, and/or whether it's possible to filter on article URL alone (requires that the date appear in the article URL).
I think the rss-fetcher polling infrastructure is adaptable, tho again, whether sitemap fetching and rss fetching should be in a single process (in case a site has both, and we want to be kind across page types) or not is open...
Code-wise sitemap parsing is simple/small, and the scheduling infrastructure needs are very similar (with differennt parameters like min and max interval).
Google news tags are (hopefully) a good indicator of links that should be considered news (and might even not require human intervention to vet the links), BUT the if the sites follows the best practices above the pages would be more volatile (tags kept for a limited time only), and require faster polling than other sitemap pages.
And finally, yes, the current "low end" dedup we currently perform is likely to behave badly for a site that has sitemap pages for historical articles that are not COMPLETELY static (ie; change header data so their page checksum changes).
A "higher end" dedup database could be a MongoDB cluster co-resident on the ES servers. The minimum information would be an index (by initial URL) of trivial (size zero?) objects.
A middle path (now that we have ES running) would be another set of ES indices (managed similarly to the "search" indices) keyed by initial URL (with no searchable/indexed fields).
An enhanced version of either of the above would be to keep some status/timestamp information for each initial URL: instead of dropping stories in the pipeline, we could queue them to a worker that updates the status (and "fork" a copy after an article is indexed). The ability to have a pipeline worker output sent to multiple input queues is already built into the system....
Some past and recent thoughts about URL retention for dedup are in https://github.com/mediacloud/rss-fetcher/issues/25
Some questions:
Many sites have a single google news tag enhanced (urlset) sitemap page, which is functionally equivalent to an rss feed:
What I've done:
The most likely next step would be:
This issue is for discussion/documentation of crawling site maps for story URL discovery.
I'm creating it in the rss-fetcher repo, since I think:
Past work on sitemap parsing exists in https://github.com/mediacloud/ultimate-sitemap-parser (so the issue could be there) but my attempts at using it are that it:
Or the issue could be in story-indexer (since it's the big/active repo).