ukwa / monitrix

A monitoring system for Heritrix 3.
Other
12 stars 9 forks source link

To account for revisit records, and display results appropriately. #8

Open anjackson opened 11 years ago

anjackson commented 11 years ago

When crawling for the second time, H3 writes revisit records, like this:

2013-02-22T14:43:42.922Z   200       2981 https://assets.digital.cabinet-office.gov.uk/static/apple-touch-icon-72x72-2ddbe540853e3ba0d30fbad2a95eab3c.png E https://www.gov.uk/government/publications image/png #020 20130222144342817+90 sha1:PGZA4RRJUGSL42LDBMU3AA3PDP7UM7J7 - 23.51.196.23,warcRevisit:digest

i.e. ending with 'warcRevisit:digest'

It would be preferable to hold two counts, the total URLs visited, and the total URLs stored (store total = visited total - deduplicated total), perhaps shown as a dial.

rsimon commented 11 years ago

Just to clarify: you're after the number of URLs visited, i.e. the total number of total log lines minus the number of log lines that have the 'warcRevisit' annotation? (Alternatively, I could check for every URL added whether it is indexed already - and then both variants should yield the same number; although the second one would have a performance penalty during ingest.)

anjackson commented 11 years ago

Yes, to the former. All numbers are interesting: total URLs, total revisit URLs, and the total new URLs. Your alternative doesn't apply as deduplication is done between crawls.

rsimon commented 11 years ago

Ok - total no. of log lines minus lines that have the annotation should be easy to do in real time. Will go onto the HOME screen!