Duplicates created from search matches when layer uses link feature

antonyscerri commented 3 years ago

Describe the bug When creating multiple annotations from a search (based on words) and the layer being used for the selected annotation is configured to use a link feature it creates two annotations per span in the search results. In the case of the selected annotation it also gets two new instances. If you enable "Override existing Annotations" then you get a single new instance and the selected one gets no additional ones.

When you try the delete you have to disable the "Delete only matching feature values" otherwise it fails to match the newly created ones.

To Reproduce Steps to reproduce the behavior: 1.Create a layer using a link feature

Open a document with multiple instance of a word
Select the layer with the link feature
Mark the first instance of the word (all values can be left as-is)
Open search and search on the chosen word
Click "create" 7 Clicking create additional times will create 2 more instances per match

Expected behavior The same behaviour as when you do not use a link feature

Please complete the following information:

Version and build ID: 0.19.3 (2021-04-25 09:58:11, build 238b0e91)

Additional context I suspect its some kind of equality check on the object when using a link feature, although this doesn't quite explain why you get double every time unless its causing some issue with the search and matching results.

Todo

[ ] Do not consider link features when comparing annotations for bulk operations
[ ] Do not fill in link features in bulk-created annotations

antonyscerri commented 3 years ago

It should also be noted if you create several sets of annotations, click create twice for example. If you then search click on one of the newly created annotations (pick the bottom one in the stack for example). Then disable "matching feature values" option and click delete you will notice that a number of them will be removed per match but not all. You can then click a remaining one and then click delete again to remove another set.

antonyscerri commented 3 years ago

I've setup a local dev env and tried using the latest code. I was able pin it down more precisely whilst setting things up, it turns out the issue occurs when creating annotations where the layer has a link feature WITH a default slot value defined. Which in the annotation editor then displays "Click to activate" against the default slot role, which you do not have to have selected anything for. So in step one above it also requires defining the tag set and picking a default slow value.

Also whilst testing the indexing seemed to fail and get stuck whilst I was testing it. This only occurred once out of three startups and testing runs i went through to pin down the issue.

org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Users/scerria/.inception/repository/project/0/indexMtas/write.lock at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) ~[lucene-core-7.7.3.jar!/:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55] at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-7.7.3.jar!/:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55] at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-7.7.3.jar!/:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55] at org.apache.lucene.index.IndexWriter.(IndexWriter.java:728) ~[lucene-core-7.7.3.jar!/:7.7.3 1a0d2a901dfec93676b0fe8be425101ceb754b85 - noble - 2020-04-21 10:31:55] at de.tudarmstadt.ukp.inception.search.index.mtas.MtasDocumentIndex.getIndexWriter(MtasDocumentIndex.java:225) ~[inception-search-mtas-0.21.0-SNAPSHOT.jar!/:?]

antonyscerri commented 3 years ago

I've only observed this with the latest build from source, but whilst testing this issue further i ran into duplicates in the search results list. I had been getting 13 results for a given search, it suddenly jumped to 15 after deleting some annotations. The search itself was only looking for words/tokens so not dependent on any other annotations. From what i can tell it looks like one document has been indexed twice without removing its old copy. Issuing a reindex removed these extra duplicates. There were no errors in the log when this occurred.

reckart commented 3 years ago

I think you are hitting this issue: https://github.com/inception-project/inception/issues/2252

antonyscerri commented 3 years ago

I think you are hitting this issue: #2252

I had not issued a reindex myself, i'd simply done a delete annotations from the search results. I just hit the problem again, I had done a word search and found 13 results. Deleted all matching annotations which says it deleted 21 due to the duplicates i had of the actual spans i wanted to remove. After waiting for the indexing to finish, when I reran the same search i had 14 results. Is this occurring when a document has multiple matches and each one is removed and triggers an index separately?

From a brief look at the indexing code, it doesnt look like indexing and removing an old instance of a document is an atomic action. Some index calls look like they are doing a lookup before adding their new copy but if another task indexes its copy in between that might not then be deleted as it will have a different timestamp. If this is the case it may not also be guaranteed that the latest copy of the document is the one which is left after multiple calls. If the timestamp is based upon when the document's CAS is updated then hopefully checking for any instances and removing all but the latest timestamp (which could include the document that's just been added).

reckart commented 3 years ago

The indexing tasks should be set up such that they do not run in parallel on the same index via the de.tudarmstadt.ukp.inception.scheduling.MatchableTask.matches() which they implement. But it seems like there is maybe a conceptual issue or at least a bug here which causes semantically overlapping indexing tasks to be run in parallel. The previous code didn't have these issues because it only used a single indexing tasks - and possibly because the matching was handled better. Right now, there should be "eventual consistency", but it seems we have a significant period of inconsistency on the way there.

antonyscerri commented 3 years ago

So back to the original issue. It appears the search results provider and page caching may be the cause. It looks like its creating a cache entry for all results, and then when displaying the results it grabs each pagination list and caches that. Then later when creating (or deleting) the annotations it gets a grouped list across all cached pages which includes the original set of all results. When combined with use of the default slot for the link feature it makes these appear as distinct annotations and creates multiple per span. The use of FSArray for the link feature is also causing a problem because it doesnt support an effective equals method, FSArrayList might be a better choice for that. Of course any differences in order will cause them not to match, but for this purpose assuming things are copied in order it should work. Of course whether annotations with (active) links want to be created in bulk, may not be a valid behaviour.

reckart commented 3 years ago

If you have time, maybe you could see how you fare with this PR: https://github.com/inception-project/inception/pull/2460

reckart commented 3 years ago

The use of FSArray for the link feature is also causing a problem because it doesnt support an effective equals method, FSArrayList might be a better choice for that. Of course any differences in order will cause them not to match, but for this purpose assuming things are copied in order it should work.

Where did you hit the FSArray? As far as I can see, when featureValuesMatchCurrentState() is called, it obtains the feature value for the slot feature via a SpanAdapter which in turn delegates to a SlotFeatureSupport - and that converts the FSArray to an ArrayList<LinkWithRoleModel>.

antonyscerri commented 3 years ago

I can confirm the adapter is changing the objects to ArrayList, I had only looked at the raw objects before and had overlooked the adapter. So it looks like it might be when it gets the link feature from the result hit annotation to compare to the model object, it gets an empty list, but the model object has an array with an item which reflects the default link role. So when the default role is not populated the UI does appear to contain an object reflecting that in the link feature list. The stored annotations in the CAS though do not store a corresponding item. So that's why the match fails.

And because of the other issue regarding the caching of results and there being two instances of the same hit, they both fail the check resulting in a new item being added.

antonyscerri commented 3 years ago

I wondered if it could use the selected annotation direct from the CAS document, rather than the UI model? Given you wouldnt be able to change the UI feature field values without them being persisted, the annotation from the CAS would be up to date and avoid the need to work out any difference between the model representation. I gave it a quick try, retrieving the AnnotationFS in actionApplyToSelectedResults for the selected annotation and then passing it in the bulk operation. Then in featureValuesMatchCurrentState it was just a case of retrieving the feature from the pair of annotations (based on the model to get the feature name). It looked like it worked for the most part, I had some count differences between results and the number created but i think this was due to duplicate (and maybe also missing) indexing issues. Atter a reindex i was able to do a few more across a corpus and it seemed to work.

reckart commented 3 years ago

TBH I think that slot features should be completely ignored when comparing annotations for the purpose of bulk-annotation. And not only for comparing, but also for applying. I think the case that a user really want to annotate all matches with the same slot fillers as the original is pretty rare if at all existant.

antonyscerri commented 3 years ago

It might be the case you need some control as to whether to ignore them or not, so you could filter down, if that's possible with the query to match those with or without the link feature it could be done that way. But yes in general ignoring them might be useful.

In terms of ensuring the feature matching works in general would a switch to using the CAS AnnotationFS rather than UI model be worth considering (I dont know whether there might be other layer setups or conditions where the model may differ in a similar way), I could put a PR together for the change I made if its of interest?

reckart commented 3 years ago

I believe that optimally, semantic operations (such as comparisons) should be performed on the converted values from the adapters and not directly on the CAS objects. This indirection allows us to smooth over details such as order of elements in arrays if desired - also as you have pointed out, checking equality directly on the FSArrays isn't necessarily working well.

antonyscerri commented 3 years ago

Array item order has the potential to need sorting which like you say the adapter can do. In this case though the problem was the UI model had the extra unfilled slot in the array as an item, but this is not going through the adapter either, so it might need some transformation as well to match behaviours.

reckart commented 3 years ago

Your PR resolves the duplicates and I opened another issue for excluding slot feature from comparison. If we need a flag to include them again, when we should have yet another issue for that flag - but let's wait until anybody actually needs it. I think that for the time being should resolve this issue. Thanks for the PR!

inception-project / inception

Duplicates created from search matches when layer uses link feature #2442