mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Start offset going backwards has a legitimate purpose [LUCENE-8776] #774

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

Here is the use case where startOffset can go backwards:

Say there is a line "Organic light-emitting-diode glows", and I want to run span queries and highlight them properly. 

During index time, light-emitting-diode is split into three words, which allows me to search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from 'diode' need to match this word. So, the order of words after splitting are: Organic, light, emitting, diode, glows. 

But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 

The way I solved this was to also generate 'light-emitting-diode' at two positions: (a) In the same position as 'light' and (b) in the same position as 'glows', like below:

organic light emitting diode glows
  light-emitting-diode   light-emitting-diode  
0 1 2 3 4

The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously the same. This works beautifully in Lucene 5.x in both searching and highlighting with span queries. 

But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards" at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments on why this check is needed. As I explained above, startOffset going backwards is perfectly valid, to deal with word splitting and span operations on these specialized use cases. On the other hand, it is not clear what value is added by this check and which highlighter code is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245. 

I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents legitimate use cases. Can this check be removed?


Legacy Jira details

LUCENE-8776 by Ram Venkat on Apr 23 2019, updated Oct 26 2021 Attachments: LUCENE-8776-proof-of-concept.patch

mikemccand commented 5 years ago

The check is important for several reasons:

[Legacy Jira: Robert Muir (@rmuir) on Apr 24 2019]

mikemccand commented 5 years ago

Robert - Thanks for the quick, detailed response. 

If negative deltas of offsets breaks postings, shouldn't the check be included only if postings are used? 

If performance gets worse for large documents, isn't it better to just log a warning, rather than completely remove that feature? Net performance depends on other factors like hardware, right?

We use the default highlighter with term vectors. Functionally, offsets going backwards works well as I explained in my previous post. We do extensive performance tests and we do not have an issue there either. 

At this point, we are forced to remove this check and recompile the source. Instead, can we move this check to where postings are used?

 

[Legacy Jira: Ram Venkat on Apr 24 2019]

mikemccand commented 5 years ago

I'd rather keep this check for the reasons Robert mentioned. Only enabling the check when offsets are indexed doesn't sound like a great trade-off to me as we'd be accepting a broken token stream that would only not cause trouble because offsets are not indexed. Can you either give light, emitting and diode the same offsets as "light-emitting-diode" or give the two additional "light-emitting-diode" tokens you are introducing the same offsets as "light" for the first one and "diode" for the last one?

[Legacy Jira: Adrien Grand (@jpountz) on Apr 24 2019]

mikemccand commented 5 years ago

If performance gets worse for large documents, isn't it better to just log a warning, rather than completely remove that feature? Net performance depends on other factors like hardware, right?

No, as computer scientists we don't write bad algorithms and tell people to buy more hardware. And as a library logging anything, especially logging a warning rather than enforcing the contract, is wrong to do.

At this point, we are forced to remove this check and recompile the source. Instead, can we move this check to where postings are used?

no.

[Legacy Jira: Robert Muir (@rmuir) on Apr 24 2019]

mikemccand commented 5 years ago

Adrien - That will not work as searching for "organic adjacent to light" would highlight the entire word "light-emitting-diode" instead of just "light". And only light or diode gets highlighted when light-emitting-diode is given the same offset as light or diode (when you search for light-emitting-diode). 

Robert,

We are not writing any new 'bad" algorithm. We have been using this feature for a while. Allowing offsets to go backwards is an existing feature in Lucene for a long time. This check and exception broke that feature. 

And, no, I am not asking anyone to buy more hardware. It's just a figure of speech to say that the net performance depends on many factors and a certain part of code being order of n-square, may or may not affect the net performance, due to many other factors. In our case, it does not. That is all the point I want to make. 

Removing a long existing feature in Lucene because (a) it affects a newer feature (postings) which is used by some people or (b) might cause a noticeable performance degradation in some cases, is not a great argument. We are dependent on this feature. We have no alternatives at this point. And, I have proof that it does not affect performance in a noticeable way, with extensive testing in our environment/data etc. Plus, I am guessing that we are not the only one in the world using this feature.  

For these reasons, we should either move this check and exception to other parts of Lucene (without affecting indexing and standard highlighter) or remove it. 

 

[Legacy Jira: Ram Venkat on Apr 24 2019]

mikemccand commented 5 years ago

Ram, it's good that this solution worked for you, but taking a step back, your solution seems like a workaround for LUCENE-7398 and LUCENE-4312. Workarounds aren't inherently bad of course, but when they depend on ambiguity of (or lack of enforcement of) contracts, backward compatibility can't be guaranteed (to paraphrase what I take Robert and Adrien to be saying).

Of course, one person's "patch" is another person's "workaround", but I'd be curious to know whether any of the "LUCENE-7398/*" branches might help for your use case. (There's a high-level description in this comment on the LUCENE-7398 issue). Particularly relevant to this discussion: the patch supports recording token positionLength in the index, and enforces index ordering by startPosition and endPosition (compatible with ordering specified for the Spans API).

[Legacy Jira: Michael Gibney (@magibney) on Apr 24 2019]

mikemccand commented 5 years ago

I think your use case can be properly handled as a token graph, without offsets going backwards, if you set proper PositionLengthAttribute for each token; indeed it's for exactly cases like this that we added PositionLengthAttribute.

Give your light-emitting-diode token PositionLengthAttribute=3 so that the consumer of the tokens knows it spans over the three separate tokens (light, emitting and diode).

To get correct behavior you must do this analysis at query time, and Lucene's query parsers will properly interpret the resulting graph and query the index correctly.  Unfortunately, you cannot properly index a token graph: Lucene discards the PositionLengthAttribute which is why if you really want to index a token graph you should insert a FlattenGraphFilter at the end of your chain.  This still discards information (loses the graph-ness) but tries to do so minimizing how queries are broken.

[Legacy Jira: Michael McCandless (@mikemccand) on Apr 24 2019]

mikemccand commented 5 years ago

Michael G and Michael M:

We do not use Lucene's standard query parser. We have made significant enhancements to Surround Parser (several thousand lines of code), and that is our primary parser. I can see how I will pass the PositionLengthAttribute and change the adjacency distance in the query. But I have to implement that in SurroundParser and I can do that sometime in the future and contribute it. 

So, back to my original point: This check broke existing applications with valid use cases, without providing a workaround. A simple way to bypass the check would be sufficient for our purposes. I believe that we should make that change. 

[Legacy Jira: Ram Venkat on Apr 24 2019]

mikemccand commented 5 years ago

I am sorry that this change broke your use-case, but rejecting backward offsets still sounds like a better trade-off to me.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 26 2019]

mikemccand commented 5 years ago

Unfortunately, you cannot properly index a token graph: Lucene discards the PositionLengthAttribute which is why if you really want to index a token graph you should insert a FlattenGraphFilter at the end of your chain. This still discards information (loses the graph-ness) but tries to do so minimizing how queries are broken.

Would there be any interest revisiting LUCENE-4312: adding support for indexed position length (PositionLengthAttribute)?

Since Ram's is an index-time use case, I see only two options:

  1. @mikemccand's suggestion, which would compromise phrase query accuracy (e.g., missing "light-emitting-diode glows"), and
  2. @jpountz's initial suggestion, which would compromise highlighting precision.

For graph TokenStreams, indexed position length could be used to fully address (as opposed to minimize) "how queries are broken", in addition to avoiding the tradeoff/compromise in the case described here. It would also enable index-time multi-term synonyms, etc ... 

[Legacy Jira: Michael Gibney (@magibney) on Apr 26 2019]

mikemccand commented 5 years ago

Adrien,

There was no trade-off required here. Postings and negative offsets could have peacefully coexisted, with different implementations using one or the other. A choice was made knowingly or unknowingly to break some implementations, to the benefit of a new feature. 

Michael G,

I like the PositionLengthAttribute. But, like I said earlier, that still means that all the relevant query parsers have to make use of the positionLength, before it can replace the negative offsets feature. Till that happens, this check cannot be forced on the users. 

 

[Legacy Jira: Ram Venkat on Apr 26 2019 [updated: Apr 27 2019]]

mikemccand commented 5 years ago

@venkat11  I'm sorry this change broke your use case.

I think allowing backwards offsets was an accidental but longstanding bug in prior versions of Lucene.  It is unfortunate your code came to rely on that bug, but we need to be able to fix our bugs and move forwards.

@magibney a 3rd option in your list would be for @venkat11 to fix his query parser to properly consume the graph, and generate fully accurate queries, the way Lucene's query parsers now do.  Then you can have precisely matching queries, no bugs.

[Legacy Jira: Michael McCandless (@mikemccand) on Apr 29 2019 [updated: Oct 26 2021]]

mikemccand commented 5 years ago

@mikemccand - You are mischaracterizing a long standing Lucene feature as a "bug". Offsets going backwards worked exactly as we wanted. But, it's not worth getting into that cliche.    I am not saying that this feature should not be retired, if it adds great value to do so. But, users should be given the time to migrate their implementations to use alternate methods. That is just a standard practice in maintaining any product or library, especially a mature library like Lucene. Hence, there should be a significant period of time, where users can bypass that check that prevents indexing such documents (with negative offsets).    About us enhancing our query parser, it is not trivial. I am not sure whether Lucene standard query parser (or whatever you are referring to), can deal with the combination of wildcards and term distance. For example, "light* adjacent_to glows" should match "light-emitting-diode glows". This can be done in our parser, but just not a small enough task for us to do as part of a version upgrade. This is why we need time to do this. 

[Legacy Jira: Ram Venkat on May 01 2019]

mikemccand commented 5 years ago

@venkat11 I do understand your frustration. Believe me, we don't take changes like this easily. One persons bug is another persons feature and as we grow and mature strong guarantess are essential for a vast majority of users, for future developments for faster iterations and more performant code. There might not be a tradeoff from your perspective, from the maintainers perspective there is. Now we can debate if a major version bump is enough time to migrate or not, our policy is that we can make BWC and behavioral changes like this in a major release. In-fact we don't do it in minors to provide you the time you need and to easy upgrades to minors. We will and have build features on top of this guarantee and in order to manage expectations I am pretty sure we won't go back an allow negative offsets. I think your best option, if you like it or not, is to work towards a fix for your issue with either the tools you have now or improve lucene for instance with the suggestion from @magibney regarding indexing more information. 

Please don't get mad at me, I am just trying to manage expectations. 

[Legacy Jira: Simon Willnauer (@s1monw) on May 02 2019 [updated: Oct 26 2021]]

mikemccand commented 5 years ago

Simon Willnauer - I am frustrated with this change to Lucene, but not at any of you. I am actually quite grateful to each of you for taking the time to respond to my comments. Fact is that I am just a user complaining about a free library, without having made any contribution. I hope to contribute the improvements I have made in the future, if/when my clients approve.    To summarize this discussion:

  1. Disallowing negative offsets is a permanent, irrevocable change to Lucene. Users will not be allowed to workaround this new restriction, although it affects only those who use postings (Unified Highlighter), at this time. 
  2. This change breaks the ability to split words at index time into multiple tokens and correctly search and highlight them with span queries, especially in conjunction with wildcards.There is no easy alternative and the solution requires a significant enhancement to query parsers. 

Can we at least either add these comments to the code or refer this jira issue from the exception thrown? If you agree, I can submit a patch. 

[Legacy Jira: Ram Venkat on May 03 2019]

mikemccand commented 5 years ago

I don't think we should change the exception message at all. Please remember: the code is open source, you can modify the source code for whatever purpose.

So I think such a message comes out wrong, I don't even buy the argument that it "broke" anything: we're not putting a gun to your head and forcing you to upgrade or anything. You can stay on an older version of the code, you can fork the code and modify it, all kinds of solutions.

[Legacy Jira: Robert Muir (@rmuir) on May 03 2019]

mikemccand commented 5 years ago

@venkat11 I am curious if you've tried the latest version of the UnifiedHighlighter that can be configured to use the "WeightMatches" API. This is toggled via the org.apache.lucene.search.uhighlight.UnifiedHighlighter.HighlightFlag#WEIGHT_MATCHES flag. This isn't the default at the Lucene level but ought to be changed to be. You may notice some highlighting differences, like for phrases and spans in which phrases as a whole get one pair of open/close tags instead of the constituent words. There's a chance that this mode ameliorates your highlighting woes with offsets.

[Legacy Jira: David Smiley (@dsmiley) on Jul 26 2019]

mikemccand commented 5 years ago

@dsmiley  I have not tried the UnifiedHighlighter yet. Phrases getting a highlight tag as a whole is a very useful and cool feature.  

My problem is that I cannot tell the highlighter that I want the highlight to begin at an earlier offset than the offset of the previous token. Is there a way to communicate that information to the UnifiedHighlighter?  

[Legacy Jira: Ram Venkat on Jul 29 2019]

mikemccand commented 5 years ago

I think your response/question is about tokenstream contracts with offsets and not about highlighting, and that's well established territory already discussed in this thread.

[Legacy Jira: David Smiley (@dsmiley) on Jul 29 2019]

mikemccand commented 5 years ago

@dsmiley   It would be challenging to discuss highlighting, in the context of this thread, without also discussing offsets. If you are asking me whether that particular feature in UnifiedHighlighter will solve my problem, based on my current understanding, answer would be 'no'. I can explain more if you want me to.     To solve my problem, either offsets have to go backwards or I have to rewrite my search logic to treat adjacency on either side differently.  

[Legacy Jira: Ram Venkat on Jul 29 2019]

mikemccand commented 4 years ago

I too suffer from the same issue, we have multi-token synonyms that can even overlap. I recognize the arguments against the backward offsets but I find them surprisingly backwards: they are saying that the implementation dictates function. When the function is what (for many people) is the goal. The arguments seem also to say that the most efficient implementation (non-negative integer deltas) does not allow backward offsets, therefore backward offsets is a bug. 

Please recognize, that the most elegant implementation sometimes mean "as complex as needed" – it is not the same as "the simplest". If negative vints consume 5 bytes instead of 4, some people need to and are willing to pay that price. Their use cases cannot be simply 'boxed' into the world where one is only looking ahead and never back (NLP is one such world)

Lucene is however inviting one particular solution:

The implementation of vint seems not mind if there is a negative offset (https://issues.apache.org/jira/browse/LUCENE-3738) and DefaultIndexingChain extends DocConsumer – the name 'Default' suggests that at some point in the past, Lucene developers wanted to provide other implementations. As it is right now, it is not easy to plug in a different 'DocConsumer' – that surely seems like an important omission! (one size fits all?).

So if we just add a simple mechanism to instruct Lucene which DocConsumer to use, then all could be happy and not have to resort to dirty hacks or forks. The most efficient impl will be the default, yet will allow us us - dirty bastards - shoot ourselves in foot if we so desire. SOLR as well as ElasticSearch devs might not mind having the option in the future - it can come in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just useful... could I do it? @rmuir @mikemccand @s1monw

 

 

 

[Legacy Jira: Roman Chyla on Aug 10 2020]

mikemccand commented 4 years ago

I think preventing negative offsets is more than just allowing people to pay the price of 5 bytes per negative offset. There are many more things than compression based on this behavior. Just a few that come to my mind right away:

So if we just add a simple mechanism to instruct Lucene which DocConsumer to use, then all could be happy and not have to resort to dirty hacks or forks. The most efficient impl will be the default, yet will allow us us - dirty bastards - shoot ourselves in foot if we so desire. SOLR as well as ElasticSearch devs might not mind having the option in the future - it can come in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just useful... could I do it?

if you feel like you wanna implement this, go ahead. I am sure there will be more issues like Check Index will not work anymore etc. There might be future features you will break with doing this. But again it's all open source, nobody forces you to upgrade or use the code we package directly. You can download the source and modify it that' is all up to you.

We as a community need to make decisions that sometimes don't work for everybody. We have a great responsibility with a project like Lucene being used in an unbelievable wide range of applications and that sometimes means to add restrictions. We don't take this easily, it's almost always a hard decisions. Having offsets going forward only will be a win for a large majority and that's why we keep on having this restrictions. I am totally sorry for you struggle.

Talking about extending DocConsumer, I am torn it should be a extension point. I know that there is an implementation that uses it out there but if I could pick I'd remove this extension since it's, as you say, way to hard to get it right.

[Legacy Jira: Simon Willnauer (@s1monw) on Aug 12 2020]

mikemccand commented 4 years ago

I just wanted to chip in a lateral idea that crossed my mind - if these offsets are required only for highlighting then one could get around the problem stated in the description of this issue by indexing positions only and retrieving hit position offsets using matches API. These positions would have to be retrieved using a different attribute than offsets attribute but that attribute could go backwards... Yes, it would require re-parsing the content for highlighting but it would also be a clean and elegant solution to the original problem without the need to hack Lucene internals.

There is a fresh set of components that could be used for building such a highlighter on top of matches API in LUCENE-9439. All that would be needed is a specific implementation of position-to-offsets strategy that retrieves those arbitrary offsets for a token's position. I think this could work. Just a thought.

[Legacy Jira: Dawid Weiss (@dweiss) on Aug 12 2020]

mikemccand commented 4 years ago

@simonw  no doubt that the decisions are not always easy, I appreciate the attention you are giving the matter. The arguments presented here are meant to provide more information and if you decide that the matter at hand has no merit, there is no point arguing (and no bad blood). It is actually easier for us to fork Lucene – but it seems (and the stress is 'seems') wrong for Lucene to intentionally limit itself in what it was doing so well, so I'm trying to nudge the case.

 

Consider this: I have disabled the checks in DefaultIndexingChain and rerun full suite of tests, these tests became failing:

 

7.7: org.apache.lucene.index.TestPostingsOffsets

8.6: org.apache.lucene.index.TestPostingsOffsets

 

You'll notice that the only new tests failing are those that enforce the check. Maybe there are no tests written for index integrity?

 

As to the issue at had: whether the implementation/extension can be made 'pluggable'. My view is following: if you constraint what is already in Lucene, you are forcing people to make forks. We are somewhere inbetween - it would be easy to provide option to plug a custom chain. It would cost little to give them the option (with BIG NEON WARNINGS pasted all over if necessary).

 

Ok, that's an argument by practicality – not a strong one. But how about the "nothing is broken" part? (yes, the tests that enforce the condition are failing – but nothing else is broken) . User cases are broken: there are already two examples of projects that got this complex scenario right (our project is one such example) - I asked in the forum, and @dsmiley  and ~gh_at struggled in their work with the same issue. I'm not meaning to drag them in for them to weigh in (but I wouldn't mind obviously ;)), I'm just trying to illustrate that the limitations break real-case scenarios.

And the benefits still seem to be in the realm of " future possibilities". Sure, that is not to be dismissed lightly, they are important concerns. But if we as engineers choose the most efficient over the most optimal, we would always be building "houses" without windows (these things loose energy, make people fall from heights, require cleaning - are incredibly wasteful!)

The next thing I could test is to run a performance test with a tokenizer chain which allows backward postings and the one which employes the flatten tokenizer and report results. But I'm going to do only if it the case is really open for consideration, otherwise it would be a waste of time.

[Legacy Jira: Roman Chyla on Aug 12 2020]

mikemccand commented 4 years ago

I really like the relatively new offset ordering constraint as a default for some of the reasons Simon gave. There's this sensibility to it; it just makes sense intuitively in a way that needs no defense.  I upgraded a bunch of old code to follow this rule and I'm happier with the upgraded version of those components than the prior behavior.  But then there's some interesting/advanced cases where the rule is simply impossible to follow.  What I'd like to see is a way for expert users to toggle this off.  Perhaps a setting on the Analyzer passed to IndexWriterConfig or just some other setting on IndexWriterConfig.  Or maybe a read-only setting on the OffsetAttribute (requiring a custom impl)?  That'd be my preference.  I think a pluggable IndexingChain thing seems too invasive / too difficult to get right.  I'm willing to roll up my sleeves and make this setting happen if I know in advance I'm not going to be vetoed on principle.

[Legacy Jira: David Smiley (@dsmiley) on Aug 13 2020]

mikemccand commented 4 years ago

But then there's some interesting/advanced cases where the rule is simply impossible to follow.

I am still struggling to understand what use cases are actually made impossible by IndexWriter's enforcing of valid offsets.

I sent a reply to [~roman.chyla@gmail.com] on the dev list to try to understand those examples better.

While the cost of writing a negative vInt was one of the motivations for stricter offset checking, another (stronger?) motivation was catching buggy analysis components that were producing invalid offsets, causing problems only detected much later (at search time).  IndexWriter's offset enforcement allows detecting such buggy TokenFilters earlier.

[Legacy Jira: Michael McCandless (@mikemccand) on Aug 13 2020]

mikemccand commented 4 years ago

I don't think it can be rearranged for both positions and offsets to be non-decreasing given the positions are fixed (so that they match on spans)?

organic light emitting diode glows
|    |    |    |    |    |    |    |
0    5    10   15   20   25   30   3 5

pos  term(s)               offset (inclusive)
0    organic               0-6
1    light                 8-12
1    light-emitting-diode  8-27
2    emitting              14-21
3    diode                 23-27
3    light-emitting-diode  8-27
4    glows                 29-33

[Legacy Jira: Dawid Weiss (@dweiss) on Aug 14 2020]

mikemccand commented 4 years ago

LUCENE-8776-proof-of-concept.patch is a proof-of-concept of what I had in mind when I mentioned the new match highlighter could source arbitrary offsets for positions. Not everything is perfect though since the default position-to-offsets strategy doesn't really know which term at a given position was the actual match, so it takes the longest possible range. This means a query for "light" would highlight "light-emitting-diode" because they are indexed at the same position. Maybe it could be improved with term matching internally... but it's not the point you're trying to make so I'm not going to look any further.

I agree with you that with the current indexing chain I don't know how to achieve the token stream with the positions and offsets you showed.

[Legacy Jira: Dawid Weiss (@dweiss) on Aug 14 2020]

mikemccand commented 4 years ago

OK I think I understand the original example in the opening post here.

The light-emitting-diode token is repeated a 2nd time (at position 3) so that the span/phrase query "light-emitting-diode glows" works correctly?  But then what about the span/phrase query "organic light-emitting-diode glows", which ought to match but I think even for your workaround (double-indexing light-emitting-diode) will not then work?

I see this solution as working around the fact that positionLength is not indexed in Lucene.  Yet, Lucene already offers an accurate way to solve all of this, at query time, by properly consuming the token graph output after tokenizing a query (including positionLength of the tokens) and creating a correct query such that all of the above examples would work correctly, without producing two or more light-emitting-diode tokens.

Are you sure you cannot use the query-time solution, so you get correct positional queries?  You are asking us to remove the IndexWriter offset checks so your workaround by double-tokenizing (which is not always correct?) can work again?

[Legacy Jira: Michael McCandless (@mikemccand) on Aug 14 2020]

mikemccand commented 4 years ago

Sorry for crossposting (into the forum and here); I will try to study @dweiss example, but here is some useful writeup – please jump to the last example; where PositionLength attribute would fail us.

assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
         "title", "THE HUBBLE constant: a summary of the hubble space telescope program"));

 

term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
 term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
 term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
 term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
 term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
 term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
 term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
 term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
 * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
 term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
 term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68

* - fails because of offsetEnd < lastToken.offsetEnd; If reordered (the multi-token synonym emitted as a last token) it would fail as well, because of the check for lastToken.beginOffset < currentToken.beginOffset. Basically, any reordering would result in a failure (unless offsets are trimmed).

The following example has additional twist because ofspace-time; the tokenizer first splits the word and generate two new tokens – those alternative tokens are then used to find synonyms (space == universe)

assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
         "title", "MIT and anti de sitter space-time"));
term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
 term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
 term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
 term=syn::massachusetts institute of technology posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
 term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
 term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
 term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
 term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
 term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
 term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
 term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
 term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
 term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33

So far, all of these cases could be handled with the new position length attribute. But let us look at a case where that would fail too.

assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
         "title", "Massachusetts Institute of Technology and antidesitter space-time"));
term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64

Notice the posLen=4 of MIT; it would cover tokensmassachusetts institute technology antidesitterwhile offsets are still correct.

[Legacy Jira: Roman Chyla on Aug 14 2020]

mikemccand commented 4 years ago

I see this solution as working around the fact that positionLength is not indexed in Lucene

I think that indeed describes the motivation behind @venkat11's example. For some such cases, restricting graph tokenStreams to query-time would help; but I don't think there's any escaping the fact that there are cases where graph tokenStreams would more appropriately be generated at index time (e.g., using extra context/information that can only be available at index time). There's always LUCENE-4312 :-) ...

I'm also still struggling to understand [~roman.chyla@gmail.com]'s examples, which re-focused attention on this issue; like @mikemccand, I don't see why any of these examples would fail based on the offset constraints in DefaultIndexingChain. AFAICT, there is no "offsetEnd < lastToken.offsetEnd" restriction; each startOffset in the examples above is >= the preceding startOffset, and each endOffset is >= its corresponding startOffset. I feel like I'm missing something. [~roman.chyla@gmail.com], can you share your actual analysis chain config?

No matter what startOffset/endOffsets your tokens have, even if they get really crazy, it should be possible (at least in principle) to order tokens such that the constraint of "no backward startOffsets" holds, no? I think, as @dsmiley hints at, where this could (perhaps?) become impossible is if such ordering would conflict with ordering based on positions.

To my mind (echoing @s1monw's "have clear bounds of what an offset can be and in what direction it can grow"): the main benefit of formalizing/enforcing the order of tokens/offsets in a tokenStream is so that consumers (including indirect consumers, e.g. via postings at query-time) know what to expect. Even an index that's not strictly-speaking "corrupt" can be made more useful/efficient if types of order that can be enforced are enforced.

Two main questions occur to me:

  1. Are there use cases that truly cannot be supported (even in principle, never mind with the current state of analysis components) with strict ordering constraints based on token offsets, positions, and maybe positionLengths?
  2. Is there enough existing functionality that people have built around the historic lack of constraints that the horse has left the barn, and the ability to toggle this behavior off should be provided, absent some practical compulsion otherwise (e.g., actual index incompatibility/corruption, as opposed to simply sanity-checking input)?

Perhaps a bit off-topic, but ideally I could see:

  1. indexed positionLength
  2. strict ordering of tokens by increasing position (this is the case by definition, I think?), and for a given position, order by increasing positionLength
  3. offsets compatible with positions such that the above position-based ordering would also result in ordering of tokens by increasing startOffset (perhaps even adding the constraint that for a given startOffset, endOffset would never decrease?)

[Legacy Jira: Michael Gibney (@magibney) on Aug 14 2020]

mikemccand commented 4 years ago

The light-emitting-diode token is repeated a 2nd time (at position 3) so that the span/phrase query "light-emitting-diode glows" works correctly?

This is the way I understood the original example at least?

But then what about the span/phrase query "organic light-emitting-diode glows", which ought to match but I think even for your workaround (double-indexing light-emitting-diode) will not then work?

I think you're correct.

Yet, Lucene already offers an accurate way to solve all of this, at query time, by properly consuming the token graph output after tokenizing a query (including positionLength of the tokens) and creating a correct query such that all of the above examples would work correctly, without producing two or more light-emitting-diode tokens

I never played with such complex graphs (redface). How would this work in indexing/ at query time? Can you write up a test case for the above, Mike?

[Legacy Jira: Dawid Weiss (@dweiss) on Aug 14 2020]

mikemccand commented 4 years ago

  @mikemccand  The solution we had worked in our case and satisfied several thousand power users for their daily needs, because they did not care about the three word adjacency issue.    I wonder whether we are over-complicating this issue.   At index time we and others generate hidden tokens. Our purpose is to match certain span queries. Others (like [~roman.chyla@gmail.com] ]) may do it for multi word synonyms. There are so many use cases for hidden tokens that we all know. When we generate such hidden tokens, we have to make sure that highlighting still works correctly. To this effect, we need to have full control over setting the offsets. "Offsets must not decrease" rule took away our power/control over highlighting.   

bq. Are you sure you cannot use the query-time solution, so you get correct positional queries?  You are asking us to remove the IndexWriter offset checks so your workaround by double-tokenizing (which is not always correct?) can work again?

 

  If you scroll above, you can see that we actually discussed this a year ago. It gets complicated because of wildcards and adjacency. "Impossibility" criteria is always difficult to prove.  

[Legacy Jira: Ram Venkat on Aug 14 2020]

mikemccand commented 4 years ago

@magibney   I'll dig for more concrete examples but when you pointed out https://issues.apache.org/jira/browse/LUCENE-4312 - that position length is not indexed – then it becomes little bit of an academic exercise. Because if position length is not indexed, then there is no hope to reconstruct the token stream; the highlighter has no information about what tokens were part of the multi-token synonym. The position length could substitute offsets, but I'm also showing in the last example where it fails (it doesn't have access to the stop words that got removed)

 

So to rely on a position length, one cannot know the span of the tokens that made the resulting multi-token synonym (in other words: position length cannot substitute offsets; even though it is proposed to solve the same thing)

 

Check this place in the indexing chain – the previous (inverterState) start and ends play role in the checks - multi-token synonyms always span more words, so in their case (to pass here) their offsets must be trimmed (or offsets of the words that made them enlarged – but that would be also weird)

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917

 

Finally, in my experience query-time synonym expansion is not enough - we have to do both. Yes, it is very subtle, but I have tried hard not to (but I can be proven wrong, if somebody tries harder :))

 

here is the analyzer chain: https://github.com/romanchyla/montysolr/blob/master/contrib/examples/adsabs/server/solr/collection1/conf/schema.xml#L406

 

and the accompanying unittest: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L62

 

on that line you'll see the reasoning why we are doing both query and index-time expansion

 

we also do query optimization - that some might find crazy - to avoid score inflation (maybe you have noticed a situation when a query expansion emits multiple synonyms; the score by default is a product of the many synonyms – you could call that score inflation). Because we can't rely on index-time expansion fully, we'll also expand at query time and pick from synonyms the one that is more frequent or less frequent (based on the user query); this optimization is only possible because we have indexed them all – I don't think you could do that in query time only

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L293

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L377

 

But I seem to be digressing, these things are only important if we were inclined to believe that query-time synonym expansion can solve it for all.

[Legacy Jira: Roman Chyla on Aug 14 2020 [updated: Oct 26 2021]]

mikemccand commented 4 years ago

(from my message to the dev list last week): Where I work, I've seen an interesting approach to mixed language text analysis in which a sophisticated Tokenizer effectively re-tokenizes an input multiple ways by producing a token stream that is a concatenation of different interpretations of the input. On a Lucene upgrade, we had to "coarsen" the offsets to the point of having highlights that point to a whole sentence instead of the words in that sentence :-(

I'll make up an example in a generic way. Our actual tests are unintelligible to me as they use a bunch of CJK characters that are foreign to me :-)

Imagine input chars "abcdef" and two tokenizers, each emitting simple linear chains (no graphs). Tokenizer1 emits: "abc", "def". Tokenizer2 emits: "ab", "cd", "ef". I believe it's impossible to combine both tokenizers in such a way that preserves both position and offset constraints. One of them has to break. If Lucene would let offsets go backwards (as it used to allow it), we simply concatenate the streams – tokenizer1's tokens then tokenizer2's tokens. It's important that someone writing in the "language" of tokenizer1 can do position sensitive queries for say "abcdef" and likewise for someone writing in the language of tokenizer2 to do do queries of "abcd" or "cdef" and have them work.

[Legacy Jira: David Smiley (@dsmiley) on Aug 14 2020]

mikemccand commented 4 years ago

@dsmiley  I am pretty sure your example would actually work today.

You would just need your tokenizer to interleave the tokens.

Remember that there is no requirement that all tokens in one sub-path be enumerated before all tokens in another sub-path.  So your tokenizer would emit something like this: abc, ab, cd, def, ef.  This permutation should also work: ab, abc, cd, def, ef.

Of course, since Lucene does not record position length in the index, in order to get something close to reasonable for positional queries, you would need to use FlattenGraphFilter or something equivalent to squash the tokens from the two side paths on top of each other.

I think there is an issue where someone made a prototype that indexed position length into payloads and then did the right thing at search time.  There was also a talk about it at Buzzwords last summer, I think.  Does anyone know this issue?

If you ask git to enumerate all commits, sorted by commit timestamp, how does it serialize its DAG?  I think in general it is always possible for git to do this (correctly sorted), regardless of how complex the commit DAG turns out, and should then also work for Lucene token streams, as long as the token stream is free to assign positions, similarly to how git is free to assign commit hashes.  And Lucene's offsets are like git's timestamps.  I think?

How would this work in indexing/ at query time? Can you write up a test case for the above, Mike?

Actually, @jimczi is the expert on how our queryparsers respect the full token graph and produce correct queries!

A test case should not be hard to make – create a query parser with an analyzer that returns a CannedTokenStream producing the full token graph, then parse any text and see what query it produces?

[Legacy Jira: Michael McCandless (@mikemccand) on Aug 14 2020]

mikemccand commented 4 years ago

I think there is an issue where someone made a prototype that indexed position length into payloads and then did the right thing at search time. There was also a talk about it at Buzzwords last summer, I think.

@mikemccand I think you're talking about the work I put in on LUCENE-7398 (and the related talk at Buzzwords). Quick update there: that work has been ported forward through 8.3 (due to be ported forward again soon I hope), and it's been solidly running in production for several years. That branch has a bunch of test cases added to TestSpanCollection.java that are relevant to the discussion here (particularly wrt the problem that the double-emitted "light-emitting-diode" token workaround seeks to address).

I'm curious about the multi-tokenizer use case; off the top of my head I can see that it would save some terms dictionary space (for tokenizers that generate a lot of tokens in common), and could help reduce overall field count, if that's an issue. Single-language positional queries (e.g. spans, intervals) should work, but mixed-language positional queries would not (not a problem, but could be useful in some circumstances?). Interleaved tokens (as opposed to concatenated streams) at query-time would probably work, and would probably "work" at index-time in the sense that they could be indexed within the constraints in DefaultIndexingChain – but index-time interleaved tokens would I think hit unexpected behavior pretty quickly for positional queries (spans, intervals) unless the graph structure (i.e., at least positionLength) were indexed ...

position length cannot substitute offsets; even though it is proposed to solve the same thing

Indeed, positionLength is not a general-purpose substitute for offsets; I never intended to suggest otherwise. But LUCENE-4312 and the LUCENE-7398 work is relevant to this issue because it presents a more robust alternative to Ram's double-token-emission workaround for positional queries.

Roman, I gather that your case is different, but unfortunately I still can't really pinpoint the problem. I see the checks in the DefaultIndexingChain code, I see that the previous invertState startPosition is part of validation, I acknowledge the need for both index-time and query-time expansion (this was the main motivation for my work on LUCENE-7398!)... but the startOffsets in the examples you sent don't decrease, and the endOffset of each token is greater than its startOffset, so I don't see anything that would violate the constraints in DefaultIndexingChain. I was hoping that seeing your actual analysis chain would help reproduce your issue locally, but it's not as straightforward as I'd hoped, given the number of custom analysis components (I also note, without thinking through possible implications, the lack of FlattenGraphFilter, and the fact that your word-delimiter and synonym filters seem (\?) to be monkey-patched variants of "pre-graph" WordDelimiterFilter and SynonymFilter).

[Legacy Jira: Michael Gibney (@magibney) on Aug 17 2020]

mikemccand commented 4 years ago

@magibney  the original issue in this ticket is about ability to run span queries and highlight them properly. As for positions, posLen attribute solves the search part (well, if it was indexed - which it is not) ; but it cannot solve the highlight part - at least I don't see how it could. And I have shown that in this example:

 

term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64

 

Notice the posLen=4 of MIT; position length attribute is **wrong** because of stopwords. It would cover tokensmassachusetts institute technology antidesitterwhile offsets are still correct.

 

That our chain is complex is true, but I don't think it matters at all, because the tokenizer is producing the stream of tokens listed above – and it is "broken" only in the sense that backward offsets are disallowed (none of our filters are broken in the sense of being 'buggy' - they are broken in the sense of producing what is now deemed "illegal"). And yet the "illegal" in this case, I and some other people would posit, is the "correct" for expert use cases. The discussion will hopefully help revisiting the problem.

 

If I were to add the flatten filter, I'd be able to index the tokens because the filter trims (makes offsets only grow, never go back). But if you think about it: that is impossible proposition with multi-token synonyms if at the same time you want to index the individual tokens that make them.

 

1: Massachusetts | Massachusetts Institute of Technology

2: institute

3: technology

 

'MIT' spans three indexed-tokens positions – but it also spans *four original tokens positions* ("Massachusetts Institute of Technology")

And if there were two parallel filters inside one tokenizer chain; each of them picking different tokens and paying attention to different stopwords – as @dsmiley  described - I don't see how can their output be interlaced to progress in a linear fashion (to not trip offset asserts).

 

BTW: The output example is produced by lucene 6.x; when ported to Lucene 7.x it produces exactly same stream - but it trips  https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L917

 

 

 

[Legacy Jira: Roman Chyla on Aug 17 2020 [updated: Oct 26 2021]]

mikemccand commented 4 years ago

Roman, I think we're on the same page re: positionLength. I read the original case on this issue (re: double-emission of tokens) as a sympathetic type of "XY problem", and I'm suggesting that indexing positionLength (LUCENE-4312 – as opposed to simply using unindexed positionLength) would be a better fundamental way to address the "Y" case (working positional queries) than accommodating the "X" case (double-emission of tokens, which may be "less" broken, but afaict is still broken in its own way).

I also need to apologize, I was indeed overlooking something: although the asterisked terms in the examples you shared above still don't seem problematic to me (and I still see no problem with the "THE HUBBLE constant: a summary of the hubble space telescope program" example), I see that the latter two examples ("MIT and anti de sitter space-time" and "Massachusetts Institute of Technology and antidesitter space-time") each have one (and only one?) problem: a backward startOffset on the last token. A couple of random thoughts on that:

  1. because the positionIncrement of each of these is "0", it would be possible in principle to swap with the preceding token to satisfy the constraints enforced by DefaultIndexingChain. This isn't an argument that the issue is irrelevant; rather, it's a wish for another example that can't in principle be "solved" in such a way.
  2. In the "gut feeling" department: I'm a little wary of this being on the last token (tokenStream components can exhibit unusual behavior at the beginning/end). If I were troubleshooting, I'd probably first add an extra term at the end of each input field value and see how this affects things, just as a sanity check before digging deeper.

FWIW, I was asking about the analysis chain for context; it's not so much the complexity of the analysis chain that prevents me from trying to reproduce locally as the fact that it uses several custom components (and some of those based on deprecated implementations) ...

[Legacy Jira: Michael Gibney (@magibney) on Aug 17 2020]

mikemccand commented 4 years ago

Michael, I appreciate your thinking about the issue - it's always pleasure to meet somebody who changes his/her mind based on evidence. As far as it seems to me, the backward going offsets are indeed the 'only' problem – but unfortunately of some 'fundamental' category. It doesn't square nicely with the desire to efficiently store offsets using deltas.

I can reassure you that the problem doesn't appear on (only) first or lasts positions. The unittests linked fro my post have specific examples for that (even searching for phrases embedded/surrounded by other tokens)

The swapping - unless i'm missing something important - will not work (for multi-token situation). Because the next token is going to be checked against offsets of the last emitted token. So wherever the multi-token gets emitted (as first, last, or even in the middle) - it will trip the wire for the surrounding tokens (or these already emitted tokens will trip the wire for the multi-word token)

 

funnily enough, swapping 0th position could trigger another issue (but only for the very first token in the stream): https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L897

 

that the 

[Legacy Jira: Roman Chyla on Aug 18 2020]

mikemccand commented 4 years ago

@magibney  now it's my turn to apologize, all the time I have missed that WordDelimiter factory was involved. Well, the problems shown in my two examples above are real (the positionLength spans 4 tokens that don't belong together), but I now have to question my understanding of everything else. See my email to @mikemccand  from few moments ago (the word-delimiter factory is involved; to my slight relief - that's a stock solr version)

 

from the email:

 

Hi Mike,

I'm sorry, the problem all the time is inside related to a word-delimiter filter factory. This is embarrassing but I have to admit publicly and self-flagellate.

A word-delimiter filter is used to split tokens, these then are used to find multi-token synonyms (hence the connection). In my desire to simplify, I have omitted that detail while writing my first email.

I went to generate the stack trace:

```java
assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
        "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program"));```

 

stage:indexer term=xxxxxxxxxx603 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68

that worked, only the next one failed:

assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
        "title", "MIT and anti de sitter space-time"));
stage:indexer term=xxxxxxxxxx604 pos=1 type=word offsetStart=0 offsetEnd=13
stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::massachusetts institute of technology pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33

 

stacktrace:

 

325677 ERROR (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40]) [    ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception writing document id 605 to the index; possible analysis error: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for field 'title'
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:242)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:1002)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:1233)
at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$2(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1082)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:694)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:188)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:125)
at org.apache.solr.util.TestHarness.update(TestHarness.java:285)
at org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:274)
at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:244)
at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:874)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:853)
at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:847)
at org.apache.solr.analysis.TestAdsabsTypeFulltextParsing.setUp(TestAdsabsTypeFulltextParsing.java:223)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:972)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:57)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=23,endOffset=33,lastStartOffset=29 for field 'title'
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:823)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1616)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1608)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
... 61 more

[Legacy Jira: Roman Chyla on Aug 19 2020 [updated: Oct 26 2021]]