Closed anjackson closed 5 years ago
For example, GOV.UK has a sitemapindex
of sitemaps. Ideally, we want to re-download the sitemapindex
and any sub-sitemaps every time we 'refresh' a host. i.e. allow child URIs to pick up the parent launch-timestamp if they are in the discovery chain: Seed -> Robots.txt -> Sitemap|Sitemapindex -> Sitemap.
So, one option would be to:
RobotsTxtSitemapExtractor
annotate the extracted URIs as isSitemapUri
, and also add a now
launch timestamp to make sure the URI is fetched?now
launch timestamp to links that appear to be links to further sitemaps.Note that sitemap entries have timestamps to use which precisely fit this purpose! i.e. if we have an intelligent sitemap parser it can issue the right re-crawl requests as we go along.
OR, external process that grabs Site maps, parses them, and enqueues links marked as updated since DATE.
Extended the Robots.txt parser and added a Sitemap parser so both use the launchTimestamp to ensure all sitemaps get re-crawled when robots.txt get refreshed. If timestamps are present in the sitemap, this are also used to set the launchTimestamp so we should get updated content promptly. See https://github.com/ukwa/ukwa-heritrix/commit/1b11d52717bfa4fc31351fc932b413541fd0d5d9
We can currently recrawl seeds or whole hosts. This means that site maps need to be added manually as seeds, and even then, sub-site-maps won't get picked up.
Ideally, a refresh would refresh a couple of hops deep, or perhaps make site maps a special case where the launch date is always inherited?