ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Verify that the continous crawler is working #25

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

The continuous crawler has been running successfully for weeks, but we need to verify that it is doing a sufficiently good job to justify the switch-over.

Proposal is to generate crawl volume breakdowns per host across daily and weekly crawl streams, and compare them to make sure they are roughly equivalent.

anjackson commented 5 years ago

Proposal is to write something to parse multiple log files, which will output

Host/Target, Launch Date, Total URLs, status codes, etc.

Not 100% clear how to do this. e.g. process log files once, output summary per log file into local file or DB. Then summarise over local files/DB. ?

anjackson commented 5 years ago

After some analysis, a couple of problems arose. See main body of ticket.

anjackson commented 5 years ago

Closing this as it doens't really fit as a ticket.