ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Generalize rules for skipping content #286

Open tokee opened 2 years ago

tokee commented 2 years ago

warc-indexer has the "index or no index of a WARC-record"-properties record_type_include, response_include, protocol_include, exclusions and url_exclude.

With some rewriting this could be fully generalized to work on any field content for the generated SolrDocument (with optimizations for the situations where "no index" can be determined before analyzing), making it posssible to use white/black-lists for MIME types, domains etc. It could be folded into the fields in the config or be a separate section.