paracrawl / giawarc

Processing utilities for Internet Archive
1 stars 0 forks source link

Support compressed markup document types #6

Open kpu opened 4 years ago

kpu commented 4 years ago

Add support for documents that are just compressed markup.
https://github.com/bitextor/bitextor/blob/7e1de1de7431b83258b8a245b3c5b72d606bd231/bitextor-warc2htmlwarc.py#L71-L105

wwaites commented 4 years ago

Those kinds of files are pretty rare. Frequency of fewer than one per WARC file from the Internet Archive. But this appears to work, I can pull text out of docx and xslx files, though not extensively tested.

I also commented out cld3 because it was driving me up the wall. Giant pain to get compiled.