machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
206 stars 13 forks source link

Generate WARC-Payload-Digest and WARC-Block-Digest for WARC records #65

Open machawk1 opened 10 years ago

machawk1 commented 10 years ago

Have not yet found a way to consistently do this via JavaScript. Same data from Htrix WARCs return hex-like values from UNIX shasum but Htrix hashes have characters beyond this scope (e.g., "M"). The WARC spec says to use a 32 bit hash but I don't know how to do this.

nlevitt commented 10 years ago

https://github.com/agnoster/base32-js ?

(Imho the base32 choice is highly regrettable. Save 8 bytes on each warc record at the expense of interoperability with everybody else in the world. But I guess we're stuck now.)

machawk1 commented 10 years ago

Thanks, @nlevitt . Would you happen to have a reference WARC with uncompressed HTML (e.g., explicit viewable in the WARC) to verify correctness between this library and what Htrix produces?

Step 0 for WARCreate is interoperability. What is the alternative/ideal hash algorithm to use, iyho?