mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Enable TokenFilters to assign offsets when splitting tokens [LUCENE-8450] #449

Open mikemccand opened 6 years ago

mikemccand commented 6 years ago

CharFilters and TokenFilters may alter token lengths, meaning that subsequent filters cannot perform simple arithmetic to calculate the original ("correct") offset of a character in the interior of the token. A similar situation exists for Tokenizers, but these can call CharFilter.correctOffset() to map offsets back to their original location in the input stream. There is no such API for TokenFilters.

This issue calls for adding an API to support use cases like highlighting the correct portion of a compound token. For example the german word "au­ßer­stand" (meaning afaict "unable to do something") will be decompounded and match "stand and "ausser", but as things are today, offsets are always set using the start and end of the tokens produced by Tokenizer, meaning that highlighters will match the entire compound.

I'm proposing to add this method to TokenStream:

     public CharOffsetMap getCharOffsetMap()­­­;

referencing a CharOffsetMap with these methods:

     int correctOffset(int currentOff);      int uncorrectOffset(int originalOff);

 

The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from original offset forward to the current "offset space".


Legacy Jira details

LUCENE-8450 by Michael Sokolov (@msokolov) on Aug 09 2018, updated Aug 20 2018 Attachments: offsets.patch

mikemccand commented 6 years ago

As mentioned on the mailing list, I don't think we should do this for tokenfilters due to complexity: it is not just on the API side, but the maintenance side too: worrying about how to keep all the tokenfilters correct.

I am sorry that some tokenfilters that really should be tokenizers extend the wrong base class, but that problem should simply be fixed.

Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead. But at least today its just one method.

[Legacy Jira: Robert Muir (@rmuir) on Aug 09 2018]

mikemccand commented 6 years ago

I am sorry that some tokenfilters that really should be tokenizers extend the wrong base class, but that problem should simply be fixed.

A tokenfilter such as decompounding can't really be a tokenizer since it needs normalization that is provided by earlier components (at the very least lower casing, but also splitting on script changes and other character normalization). I guess one could just smoosh all analysis logic into the tokenizer, but that really defeats the purpose of the architecture, which supports a nicely modular chain of filters.

I suppose there is potential maintenance pain. First though, note that the patch I posted here does not actually implement this for all existing tokenfilters and tokenizers. It merely demonstrates the approach (and changes the implementation, but not the behavior of char filters). We can merge this patch and everything will work just as it did before. Once we actually start using the API and downstream consumers rely on the offsets being correct-able, then there would be some expectation of maintaining that. Let me work those changes into the patch for at least ICUFoldingFilter so we can see how burdensome that would be. I'll also note that the effect of failing to maintain would be just that token-splitting tokenfilters generate broken offsets, as they do today, just differently broken.

[Legacy Jira: Michael Sokolov (@msokolov) on Aug 09 2018]

mikemccand commented 6 years ago

I feel pretty strongly that we shouldn't go this route. The thing dividing text up into tokens, tokenizer is the lucene class for that.

Also when i say maintenance side just look at the offensive filters that really should be tokenizers: they are all a real nightmare: cjkbigramfilter, worddelimiterfilter, etc. These things are monstrously complex and difficult to work with: its clearly not what a tokenfilter should be doing.

I don't want to change this situation into "differently broken" where somehow now we have to add logic to hundreds of tokenfilters when we could just fix the 2 or 3 bad ones such as wdf to be tokenizers instead.

[Legacy Jira: Robert Muir (@rmuir) on Aug 09 2018]

mikemccand commented 6 years ago

This approach does not impose any requirement to add any logic to all existing token filters. To get the benefit, it's only really necessary to change filters that change the length of tokens, and there are pretty few of these. As far as "tokenizing" token filters, they are basically operating as consumers of corrected offsets, and we can choose to leave the situation as is on a case-by-case basis for these. We can continue to use the existing full-width offsets for the generated sub-tokens, just ignoring this correction API, and fix only the ones we want to.

I agree that tokenizer should in general be the class for splitting tokens, but there are reasons why these other filters have been implemented as they are; I mentioned some regarding DictionaryCompoundWordTokenFilter. I don't really know about the design of CJKBigramFilter, but it seems it also relies on ICUTokenizer running prior?

[Legacy Jira: Michael Sokolov (@msokolov) on Aug 09 2018]

mikemccand commented 6 years ago

Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead.

I like that idea – it always seemed weird that the correctOffset was only available via CharFilter whereas OffsetAttribute is really the more natural place for it.

 

[Legacy Jira: Michael McCandless (@mikemccand) on Aug 10 2018]

mikemccand commented 6 years ago

To get the benefit, it's only really necessary to change filters that change the length of tokens, and there are pretty few of these

Sorry, I think this is wrong. For example most language stemmers are doing simple suffix or prefix stripping of tokens. They definitely change the length. These are hard enough, why should they have to deal with this artificial problem?

In general your problem is caused by "decompounders" that use the wrong base class. For chinese, decomposition is a tokenizer. For japanese, its the same. For korean, its the same. These aren't the buggy ones.

The problem is stuff like the Decompound* filters geared at stuff like german language, and the WordDelimiterFilter, and so on. These should be fixed. Sorry, this is honestly still a tokenization problem: breaking the text into meaningful tokens. These should not be tokenfilters, that will fix the issue.

Maybe it makes sense for something like standardtokenizer to offer a "decompound hook" or something that is very limited (e.g., not a chain, just one thing) so that european language decompounders don't need to duplicate a lot of the logic around punctuation and unicode. Perhaps the same functionality could be used for "word-delimiter" so that people can have a "unicode standard" tokenizer but it just handles some ambiguous cases differently (such as when the case-changes and when there are hyphens etc). I think lucene is weak here, i don't think we should cancel the issue, but at the same time I don't think we should try to give tokenfilter "tokenizer" capabilities just for artificial code-reuse purposes: the abstractions need to make sense so that we can prevent and detect bugs and do a good job testing and maintain all the code.

[Legacy Jira: Robert Muir (@rmuir) on Aug 10 2018]

mikemccand commented 6 years ago

Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead.

I think correctOffset should indeed be part of the OffsetAttribute (we need to extend the interface). But we have to make sure, that it does not contain any hidden state. Attributes are only "beans" with getters and setters and no hidden state, and must be symmetric (if you set something by setter, the getter must return it unmodified). They can be used as a state on their own (like flagsattribute) to control later filters, but they should not have any hidden state, that affects how the setters work.

Maybe it makes sense for something like standardtokenizer to offer a "decompound hook" or something that is very limited (e.g., not a chain, just one thing) so that european language decompounders don't need to duplicate a lot of the logic around punctuation and unicode

Actually that the real solution for the decompounding or WordDelimiterFilter. Actually all tokenizers should support it. Maybe that can be done in the base class and the incrementToken() get's final. Instead the parsing code could push tokens that are passed to decompounder and then icrementToken returns them. So incrementToken is final and calls some next method on the tokenization and passes the result to the decompounder. Which is is a no-op by default.

Another way would be to have a special type of TokenFilter where the input is not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead of "TokenStream", the "input" field is also Tokenizer). In general decompounders should always be directly after the tokenizer (some of them may need to lowercase currently to process the token like dictionary based decompounders, but that's a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer and call correctOffset on them, if they split tokens.

[Legacy Jira: Uwe Schindler (@uschindler) on Aug 10 2018]

mikemccand commented 6 years ago

about Tokenizer/decompounder as a paired thing, not a chain: then you could not combine WordDelimiterFilter with another decompounder? One would still need to be a filter. I'm curious what problem there would be with stacking Tokenizers, say?

I'm also reluctant to place WDF early in the chain because it does some weird recombinations that I only want to happen after I have already had a chance to do more tightly-scoped processing like handling numbers: decimals, fractions and the like.

[Legacy Jira: Michael Sokolov (@msokolov) on Aug 10 2018]

mikemccand commented 6 years ago

{quotdine} I'm also reluctant to place WDF early in the chain because it does some weird recombinations that I only want to happen after I have already had a chance to do more tightly-scoped processing like handling numbers: decimals, fractions and the like. {quote}

I get that sentiment, but i would like for us to think a little bit farther than just "before" for "after" according to the current status quo. The fact of life here is, it seems you want to customize the tokenizer in certain ways (e.g. numerics) but also other ways such as hyphens or similar. I don't think its abnormal or a strange use-case, I think we just don't support it well with the current APIs.

[Legacy Jira: Robert Muir (@rmuir) on Aug 10 2018]

mikemccand commented 6 years ago

The fact of life here is, it seems you want to customize the tokenizer in certain ways (e.g. numerics) but also other ways such as hyphens or similar

If something like WDGF existed that did not turn "1-1/2" into "112", but still did a reasonable job at handling part numbers like  "X-4000" <-> X4000, that might help avoid some of the sequencing constraints. I would still want to be able to use German decompounding in the same Analyzer with this punctuation-handling thing though. WIth perfect highlighting :) I don't have any better solutions yet, but I like the way this discussion is opening up possibilities!

[Legacy Jira: Michael Sokolov (@msokolov) on Aug 10 2018]

mikemccand commented 6 years ago

Actually that the real solution for the decompounding or WordDelimiterFilter. Actually all tokenizers should support it. Maybe that can be done in the base class and the incrementToken() get's final. Instead the parsing code could push tokens that are passed to decompounder and then icrementToken returns them. So incrementToken is final and calls some next method on the tokenization and passes the result to the decompounder. Which is is a no-op by default.

If we take this simplistic approach it means the decompounder only sees a single token (versus say, entire sentence or phrase)? This might only work for "easy" decompounder algorithms like WDF and the german Decompounding* implementations. Maybe it is possible to refactor ThaiTokenizer to this and it will also be fine? Currently that one gets the context of the whole sentence (but I am not sure it needs that / impacts the current underlying algorithm).

But I think chinese, japanese, and korean tokenizers use more context than just one whitespace/punctuation delimited word (n-gram features and so on in the model). So its good just to think things through a bit, would be great to consolidate a lot of this if we can.

At the same time I think its ok to make the API limited, you know, if we think that will help true use-cases today. So we could just document that if you use this decompounder interface that you only see individual delimited tokens and not some bigger context. I'm hoping we can avoid situations where the algorithm has to capture/restore a bunch of state: if we end out with that, then things haven't really gotten any better.

[Legacy Jira: Robert Muir (@rmuir) on Aug 11 2018]

mikemccand commented 6 years ago

I'm hoping we can avoid situations where the algorithm has to capture/restore a bunch of state: if we end out with that, then things haven't really gotten any better.

If some decompounder needs that, I think it should keep the tracking internally. As it is called for each token in order, it can keep track of previous tokens internally (it can just clone attributes when its called and add to linked lists, wtf). No need to add that in the API.

[Legacy Jira: Uwe Schindler (@uschindler) on Aug 11 2018]

mikemccand commented 6 years ago

I was trying to think about how to make progress here, but I am still hung up on the "bug" Uwe pointed out:

... In general decompounders should always be directly after the tokenizer (some of them may need to lowercase currently to process the token like dictionary based decompounders, but that's a bug, IMHO).

then thinking oh maybe lowercasing can be handled by a charfilter, but then some token filters will want access to the original case, so it's a conundrum. If you have a sequence of text processors you really do want to be able to choose their order flexibly. If we make hard structural constraints like this, it seems to be creating more problems that will be difficult to solve. Do we have ideas about how to handle this case-folding in decompounders in this scheme? Would we just fold the case-folding into the decompounder? What about other character normalization like ß =>  "ss", ligatures, accent-folding and so on? Does the decompounder implement all that internally? Do we force that to happen in a CharFilter if you want to use a decompounder?

[Legacy Jira: Michael Sokolov (@msokolov) on Aug 19 2018]

mikemccand commented 6 years ago

Look at worddelimiterfilter, it wants to use case information in order to split stuff up for example.

Tokenizer/Decompounder's job is to divide the text into tokens. That is it. making things lowercase, removing accents, none of that is related to this process at all whatsoever.

The subsequent filters need to be doing the actual normalization.

[Legacy Jira: Robert Muir (@rmuir) on Aug 20 2018]