mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

JapaneseNumberFilter does not take whitespaces into account when concatenating numbers [LUCENE-8959] #956

Open mikemccand opened 4 years ago

mikemccand commented 4 years ago

Today the JapaneseNumberFilter tries to concatenate numbers even if they are separated by whitespaces. So for instance "10 100" is rewritten into "10100" even if the tokenizer doesn't discard punctuations. In practice this is not an issue but this can lead to giant number of tokens if there are a lot of numbers separated by spaces. The number of concatenation should be configurable with a sane default limit in order to avoid creating giant tokens that slows down the analysis if the tokenizer is not correctly configured.


Legacy Jira details

LUCENE-8959 by Jim Ferenczi (@jimczi) on Aug 29 2019

mikemccand commented 4 years ago

Sounds like a good idea.  This is also rather big rabbit hole... 

Would it be useful to consider making the digit grouping separators configurable as part of a bigger scheme here?

In Japanese, if you're processing text with SI numbers, I believe space is a valid digit grouping.

[Legacy Jira: Christian Moen on Aug 29 2019]

mikemccand commented 4 years ago

Update: Whitespaces were removed in my tests because I was using the default JapanesePartOfSpeechStopFilter before the JapaneseNumberFilter. The behavior is correct when discardPunctuations is correctly set and the JapanesePartOfSpeechStopFilter is the first filter in the chain. We could protect against the rabbit hole for users that forget to set discardPunctuations to false or remove the whitespaces in a preceding filter but the behavior is correct. Sorry for the false alarm.

[Legacy Jira: Jim Ferenczi (@jimczi) on Aug 29 2019]