warc-indexer has the "index or no index of a WARC-record"-properties record_type_include, response_include, protocol_include, exclusions and url_exclude.
With some rewriting this could be fully generalized to work on any field content for the generated SolrDocument (with optimizations for the situations where "no index" can be determined before analyzing), making it posssible to use white/black-lists for MIME types, domains etc. It could be folded into the fields in the config or be a separate section.
warc-indexer has the "index or no index of a WARC-record"-properties
record_type_include
,response_include
,protocol_include
,exclusions
andurl_exclude
.With some rewriting this could be fully generalized to work on any field content for the generated SolrDocument (with optimizations for the situations where "no index" can be determined before analyzing), making it posssible to use white/black-lists for MIME types, domains etc. It could be folded into the
fields
in the config or be a separate section.