mchaput / whoosh

Pure-Python full-text search library
Other
569 stars 69 forks source link

NestedParent query is missing valid results #31

Open hlbnet opened 2 years ago

hlbnet commented 2 years ago

In some circumstances, a NestedParent query does not return what it should (see attached code to reproduce, change file extension to py).

First execute this code as-is. Output: "Found 1 document(s)" is ok. Retry after uncommenting line 35. Output: "Found 0 document(s)" is wrong (should still find 1 document).

Additionnally, you can play with this example. If the content of the line 35 is moved after the line 40, the correct result will be observed again. Same if you do not move the line, but change the firstname 'Philibert' to something else.

Using Whoosh 2.7.4 on Windows 10 with Python 3.10.4.

The bug comes if a document matches the 'children' criteria of the NestedParent query. but is itself not in any group, or not in a group matching the 'parent' criteria. This document has no reason be be returned, and is not (good). But, the simple existence of this document makes that the query will not return any next result, even if other documents exists that should be returned by the query.

nestedparentbug.txt

hlbnet commented 1 year ago

Trying to understand this bug, I realised that the methods 'start_group' and 'end_group' are not implemented. These two methods are defined in the class writing.IndexWriter with a single line implementation: 'pass', and they are not implemented in the implementing subclass writing.SegmentWriter or any other subclass.

Knowing that, we can deduce that the parent-child relation is not stored in the index. The conclusion is that the queries NestedParent and NestedChildren can't work properly because the mandatory information is not present in the indexes.