ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

How to ensure sitemaps and multi-level sitemaps get refreshed? #42

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

We can currently recrawl seeds or whole hosts. This means that site maps need to be added manually as seeds, and even then, sub-site-maps won't get picked up.

Ideally, a refresh would refresh a couple of hops deep, or perhaps make site maps a special case where the launch date is always inherited?

anjackson commented 5 years ago

For example, GOV.UK has a sitemapindex of sitemaps. Ideally, we want to re-download the sitemapindex and any sub-sitemaps every time we 'refresh' a host. i.e. allow child URIs to pick up the parent launch-timestamp if they are in the discovery chain: Seed -> Robots.txt -> Sitemap|Sitemapindex -> Sitemap.

So, one option would be to:

  1. Make RobotsTxtSitemapExtractor annotate the extracted URIs as isSitemapUri, and also add a now launch timestamp to make sure the URI is fetched?
  2. Create a variant XML extractor that adds a now launch timestamp to links that appear to be links to further sitemaps.

Note that sitemap entries have timestamps to use which precisely fit this purpose! i.e. if we have an intelligent sitemap parser it can issue the right re-crawl requests as we go along.

OR, external process that grabs Site maps, parses them, and enqueues links marked as updated since DATE.

anjackson commented 5 years ago

Extended the Robots.txt parser and added a Sitemap parser so both use the launchTimestamp to ensure all sitemaps get re-crawled when robots.txt get refreshed. If timestamps are present in the sitemap, this are also used to set the launchTimestamp so we should get updated content promptly. See https://github.com/ukwa/ukwa-heritrix/commit/1b11d52717bfa4fc31351fc932b413541fd0d5d9