ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116 stars 25 forks source link

Add WARC compressed record length to the extraction #300

Open anjackson opened 2 years ago

anjackson commented 2 years ago

Can we use e.g. a counting stream-reader to work out how long each WARC record is (compressed?).

tokee commented 1 year ago

We (the Royal Danish Library) would like this for CDX API support. It is definitely possible, and I am 75% sure the functionality is already there, just buried at an unknown level in the convoluted stack of IndexStreams that is used. I'll see if I can find the time to dig into this.