ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Add meta stats fields #290

Open tokee opened 2 years ago

tokee commented 2 years ago

Running a web archive is often about managing scale. And about learning from experience when building the next iteration. Related to #205, which provides statistics aimed at quantitative analyses of content, we could use some index metrics:

This would help locating "large documents" and subsequently do qualified adjustments of field limits in the config file for the next full index.

Technically it would be simple to implement, as a post-analysis hook that iterates warc-indexer's Solr Document representation.