ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

url_search with CamelCasing does not work as intended #178

Open tokee opened 6 years ago

tokee commented 6 years ago

The url_search field is of type path, which uses the WordDelimiterFilterFactory, which splits things like foo/bar_ZooBoom.png to [foo, bar, Zoo, Boom, png], which are then lowercased to [foo, bar, zoo, boom, png].

This all works fine with a search for e.g. url_search:(foo bar png), but the CamelCase-part (ZooBoom) is not handled properly. If the search is for url_search:(zooboom) or url_search:(Zooboom), there will be a hit, but if the search is for url_search:(zoo boom) or url_search:(ZooBoom), there won't be. Very counter-intuitive.

The filter passes the tokens from a CamelCased String ZooBoom as [Zoo, Boom] without any distance between the tokens (normally the distance is 1): They are two separate tokens, but are treated for most purposes as one. Querying for zooboom or Zooboom just means standard lower-casing matches the collapsed indexed tokens. Querying for zoo boom does not match as they gets parsed as two tokens with 1 as distance. I haven't figured out why ZooBoom does not work.

The WordDelimiterFilterFactory is deprecated in favour of WordDelimiterGraphFilterFactory, but a simple switch to the new factory & a re-index on a test-corpus did not solve the problem. More investigation needed.