[ ] Name fields so the common ones are consistent with the CDXJ specification, i.e. make this an extension of CDXJ.
[ ] Document usage, noting that dynamic content can't be easily extracted because it's dynamic.
[ ] Store content_first_bytes without spaces? Store content_ffb as well or instead of content_first_bytes?
[ ] Consider allowing payload inclusion if small, e.g. smaller HTML files or initial binary chunk.
[ ] Consider extending the API so consumers can use the reference (name/offset) to get the payload InputStream.
[ ] Include warcinfo records in the JSONL output (currently skipped by the windex.extract).
[ ] Should boiler_pipe extraction be used?
[ ] Should extracted links be normalised?
[ ] Should image and/or PDF analysis be enabled?
[ ] Should the original payload be included if small enough? Or just for text?
[ ] Should there be an option to only output the term frequency or colocation statistics of the text? So we can do this for everything? Perhaps that's better as a post-processing step?
[ ] Both the Tika configuration extract_all_metadata and the experimental WARCStats code show there are lots of other metadata fields that might be of interest. These could be stored in some kind of hash, but not that Parquet/Avro schema reflection does not support hashes directly. The MementoRecord class illustrates that the Memento bean could be implemented on top of an extensible hash-map, which might make dynamic Parquet schema generation possible.
Following #299
content_first_bytes
without spaces? Storecontent_ffb
as well or instead ofcontent_first_bytes
?warcinfo
records in the JSONL output (currently skipped by thewindex.extract
).boiler_pipe
extraction be used?extract_all_metadata
and the experimental WARCStats code show there are lots of other metadata fields that might be of interest. These could be stored in some kind of hash, but not that Parquet/Avro schema reflection does not support hashes directly. TheMementoRecord
class illustrates that theMemento
bean could be implemented on top of an extensible hash-map, which might make dynamic Parquet schema generation possible.Example WARC Stats code output: