ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

WARC fields not populated in Kafka crawl log #90

Open anjackson opened 10 months ago

anjackson commented 10 months ago

In #89 it was noted that the warc_filename and warc_offset appear to null when they should not be.

Note the fields are fine in the actual file-based crawl log:

2023-08-29T09:45:48.367Z   301          0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N
NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC
EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}

So this is something to do with the Kafka version. We use

https://github.com/ukwa/ukwa-heritrix/blob/0c21b2756c823697839013254a66f06f80cfea3b/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java#L141-L148

Which calls

https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15

So should be working, but perhaps this is just an order-of-operations problem?

Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.

https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722