Open anjackson opened 11 years ago
Just to clarify: you're after the number of URLs visited, i.e. the total number of total log lines minus the number of log lines that have the 'warcRevisit' annotation? (Alternatively, I could check for every URL added whether it is indexed already - and then both variants should yield the same number; although the second one would have a performance penalty during ingest.)
Yes, to the former. All numbers are interesting: total URLs, total revisit URLs, and the total new URLs. Your alternative doesn't apply as deduplication is done between crawls.
Ok - total no. of log lines minus lines that have the annotation should be easy to do in real time. Will go onto the HOME screen!
When crawling for the second time, H3 writes revisit records, like this:
i.e. ending with 'warcRevisit:digest'
It would be preferable to hold two counts, the total URLs visited, and the total URLs stored (store total = visited total - deduplicated total), perhaps shown as a dial.