Closed dportabella closed 8 years ago
I'm pretty sure a WET is just a WARC file, but with 'conversion' records that contain text/plain. So, parsing WET files is already supported, at least in terms of basic parsing.
Looking at https://github.com/lintool/warcbase/issues/250 and tracking down the RecordLoader it looks like that is automatically filtering out anything other than 'response' records. If that was made configurable so you could change it to access conversion records, I think you're all set.
+1 to Andy's comments. Any 'deeper' or more advanced processing of WET files probably belong in a dedicated 'WET library'.
cool! I just told the warcbase guys about this. https://github.com/lintool/warcbase/issues/250#issuecomment-250478304
CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).
Is it possible to support WET files with webarchive-commons?
or shall I implement this feature (to handle WET archives)?
is this a feature that you would include in the webarchive-commons library?
do you see any shortcoming/problem/comment on this?