iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

support WET files #66

Closed dportabella closed 7 years ago

dportabella commented 7 years ago

CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).

Is it possible to support WET files with webarchive-commons?

or shall I implement this feature (to handle WET archives)?

is this a feature that you would include in the webarchive-commons library?

do you see any shortcoming/problem/comment on this?

anjackson commented 7 years ago

I'm pretty sure a WET is just a WARC file, but with 'conversion' records that contain text/plain. So, parsing WET files is already supported, at least in terms of basic parsing.

Looking at https://github.com/lintool/warcbase/issues/250 and tracking down the RecordLoader it looks like that is automatically filtering out anything other than 'response' records. If that was made configurable so you could change it to access conversion records, I think you're all set.

kris-sigur commented 7 years ago

+1 to Andy's comments. Any 'deeper' or more advanced processing of WET files probably belong in a dedicated 'WET library'.

dportabella commented 7 years ago

cool! I just told the warcbase guys about this. https://github.com/lintool/warcbase/issues/250#issuecomment-250478304