ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Add WARC compressed record length to the extraction #300

Open anjackson opened 1 year ago

anjackson commented 1 year ago

Can we use e.g. a counting stream-reader to work out how long each WARC record is (compressed?).

tokee commented 10 months ago

We (the Royal Danish Library) would like this for CDX API support. It is definitely possible, and I am 75% sure the functionality is already there, just buried at an unknown level in the convoluted stack of IndexStreams that is used. I'll see if I can find the time to dig into this.