openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
166 stars 49 forks source link

`SeachIterator::getTitle()` gives unaccented title #584

Closed maneeshpm closed 3 years ago

maneeshpm commented 3 years ago

SearchIterator::getTitle() gives the unaccented title instead of the actual title. That is, even if the title is "DeLorean", we get "delorean" as the output.

When we index the title, the value stored in title:0 slot of the valuesmap is the unaccented title. This happens because zim::removeAccents() is called in the constructor of DefaultIndexData.

This was missed by our unit tests for several reasons like calling the getTitle from the dereferenced entry, non-availability of mix of upper and lower case in tests where we actually call it from the search iterator. An easy fix is to drop this behavior from the constructor because we are anyway explicitly calling zim::removeAccents() where it is really required in XapianIndexer::indexTitle(). Suggestions?

maneeshpm commented 3 years ago

ping @mgautierfr

kelson42 commented 3 years ago

Ticket is unclear to me, what is concretly the bug and the impact?

maneeshpm commented 3 years ago

The bug is we will be getting the unaccented title using the SearchIterator::getTitle method. That is, if the actual title is "DeLorean", we will get "delorean" as the output instead of getting the title as it is.

mgautierfr commented 3 years ago

I'm not sure of what to do here.

See https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/values.html As explained, the values are useful to do query/sort on them, not to store user information on the document.

We are using the title value to sort the result by title when we are in suggestion mode, and we probably want this sort being unaccented and case insensitive.

We could :

maneeshpm commented 3 years ago

@mgautierfr I think we should go ahead and make it return the title of the entry. This way, we won't break anything that is already working and make the function behave in an expected manner.