Open anjackson opened 10 months ago
In #89 it was noted that the warc_filename and warc_offset appear to null when they should not be.
warc_filename
warc_offset
null
Note the fields are fine in the actual file-based crawl log:
2023-08-29T09:45:48.367Z 301 0 https://sneezecount.joyfeed.com/three-thousand-nine-hundred-and-two// LRRLL https://sneezecount.joyfeed.com/2014/04/ text/html #600 20230829094547420+614 sha1:3I42H3S6N NFQ2MSVX7XZKYAYSCX5QBYJ - ip:212.84.88.220,geo:GB {"contentSize":308,"warcFilename":"BL-NPLD-20230829033616191-116600-105~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":918370261,"scopeDecision":"ACC EPT by rule #5 ExternalGeoLocationDecideRule","warcFileRecordLength":1339}
So this is something to do with the Kafka version. We use
https://github.com/ukwa/ukwa-heritrix/blob/0c21b2756c823697839013254a66f06f80cfea3b/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java#L141-L148
Which calls
https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15
So should be working, but perhaps this is just an order-of-operations problem?
Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.
https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722
In #89 it was noted that the
warc_filename
andwarc_offset
appear tonull
when they should not be.Note the fields are fine in the actual file-based crawl log:
So this is something to do with the Kafka version. We use
https://github.com/ukwa/ukwa-heritrix/blob/0c21b2756c823697839013254a66f06f80cfea3b/src/main/java/uk/bl/wap/crawler/postprocessor/KafkaKeyedCrawlLogFeed.java#L141-L148
Which calls
https://github.com/internetarchive/heritrix3/blob/8563f491a5b355c39a89f51b17c76aaa84752a8a/contrib/src/main/java/org/archive/modules/postprocessor/CrawlLogJsonBuilder.java#L15
So should be working, but perhaps this is just an order-of-operations problem?
Yes, looks like the crawl log kafka thing is written before the WARC, for some complicated reasons that need picking apart.
https://github.com/ukwa/ukwa-heritrix/blame/master/jobs/frequent/crawler-beans.cxml#L722