ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

Field rewrite #257

Closed tokee closed 2 years ago

tokee commented 3 years ago

Clean up the existing content adjustment mechanism for SolrDocument (max content length, UTF8 sanitising, white space normalisation) and add optional regexp-based replacement rules.

This closes #256 and makes it easier to implement #152

The easiest way to see how this works is to open reference.conf and look at the field_setup-section.

With an eye to #152, we should consider having a default max for both max_values and max_length to guard against any single resource blowing up because the author decided to make a bomb, e.g. millions of links on a page.

tokee commented 3 years ago

Whoops, this started as a url_norm-only feature and expanded into a generic mechanism for all fields, but I forgot to remove the url_norm-specific code. Will do after vacation.

tokee commented 3 years ago

The special URL length handling has now been generified and the pull request is ready for review.