uvacw / inca

24 stars 6 forks source link

_type is deprecated, change ES-queries #288

Closed damian0604 closed 6 years ago

damian0604 commented 6 years ago

We recently changed querying _doctype to _type in order to solve the issue that the queries returned non-exact matches, which is unacceptable for functions like delete_doctype().

However, it seems that this is not the perfect solution, as ES will phase out _type. Thanks to Erik for pointing this out:

ik kwam op de inca issue-tracker de volgende (opgeloste) bug tegen: https://github.com/uvacw/inca/issues/250. De uiteindelijke fix voor deze bug was het gebruik van de _type variabele ipv doctype. Echter gaat ES het _type-veld uitfaseren (zie https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html). Ook staat er hier (https://github.com/uvacw/inca/pull/256) dat [inca.core.database.delete_doctype('trouw (www)')] niet werkt, maar [inca.core.database.delete_doctype('geenstijl')] wel. Het verschil tussen deze twee queries is dat de eerste bestaat uit twee woorden (trouw en www), terwijl de tweede query maar één woord bevat (geenstijl). De oplossing hiervoor (zonder gebruik van de depricated _type variabele) is naar mijn idee om alle "match" clauses te vervangen door "term" clauses. Volgens de ES documentatie is het verschil als volgt:

The term query finds documents that contain the exact term specified in the inverted index. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to or or and to control the boolean clauses (defaults to or). https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

Omdat "match" default naar [or], levert een "match" query dus resultaten op waarbij doctype "trouw" of "(www)" bevat, terwijl een "term" query alleen resultaten oplevert waarbij doctype exact "trouw (www)" is.

Hoop dat jullie hier iets aan hebben/mee kunnen.

Erik

We should therefore check out whether Erik's proposed solution works and if so, change our code accordibly.

damian0604 commented 6 years ago

@FeLoe as you have already worked with these things, this might be something you could take up. Feel free to say no if you are busy with other stuff, though! (if you take it, just self-assign the issue to yourself).

bobvdvelde commented 6 years ago

Good comment. We should be mindfull that deployments that did not use the right mappings might show unexpected results due to analysis even with term matching. More on this later...

FeLoe commented 6 years ago

Yes, I can take a look at this - I think that Erik's solution should work, I will check that the next days and adjust the code if it works.

FeLoe commented 6 years ago

Just a short question: Did I understand it correctly that Erik's solution implied that using the "doctype" field in combination with term queries could solve the problem as this is a sort of explicit type field as mentioned as solution for the deprecated _type (https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html)? In that case, the problem is that "doctype" seems to be mapped as type "text" and not as type "keyword" in the database entries. This leads to problems with term queries as only individual words can be searched and not specific combinations of words (see also here https://www.elastic.co/guide/en/elasticsearch/reference/6.0/query-dsl-term-query.html).

So if I understood the solution correctly we will likely run into problems with it...

damian0604 commented 6 years ago

I think that was what @bobvdvelde was hinting at... But given that there are not hundreds of computers running inca at the moment ;-), but only a few (all of us owned by us) we might still want to do it, given that we can convert the existing field in the existing running deployments

bobvdvelde commented 6 years ago

Hi @damian0604 and @FeLoe , that was indeed what I was hinting at (sorry, stuff came up and I had to run). In the new schema, the doctype field is mapped as keyword and not analyzed, which should do exact matching on the entire string when using term queries (basically, where the query is itself not analyzed). Because the schema.json was not forwarded to Elasticsearch in early implementations of the ES5 migration, this may mean that older instances will not have the right mapping and therefore run into the issues you describe @FeLoe .

FeLoe commented 6 years ago

Would then a temporary solution combining _type and term queries be best for now? Problem is that I cannot test the combination of doctype and term queries as I apparently have an older instance on my laptop thus it won't work correctly. Or how do I get the newer implementation? I also found some more "missing" and "and" queries, so I could fix this all in one.

bobvdvelde commented 6 years ago

Hi @FeLoe ,

The mapping is automatically applied when the 'inca' index is missing. In older versions, the current implementation should work because the _type field is still supported. If you switch to running an ES database where the _type is deprecated, you probably start without an inca index and thus get the new mapping. If you want to update your mapping, than I believe you will need to re-index all your records. I'm hoping to finish some simple importers/exporters today so you can easily dump records and load them later.

If you want to know how mappings are set, have a look at the core/database.py file and the mapping specification in the schema.json file.

FeLoe commented 6 years ago

Thank you for the explanation @bobvdvelde - the _type field is still supported in my mappings - should I now make the changes (use term queries but still work with _type) as this solution will not work with using the "doctype" field? It will then be temporary but still work better than the current solution.

damian0604 commented 6 years ago

@bobvdvelde is on holiday, but I think the best solution would be to basically check whether the doctype field is mapped as keyword and not analyzed (there must be a function to check this), and if so, use the new approach. Else use _type. In the latter case, we could additionally trigger a warning like this one: logger.warning('Deprecation warning: You still seem to be using an INCA-index in Elsticsearch in without the correct schema. Please migrate, as we will remove support for these databases in the future.'

damian0604 commented 6 years ago

Thanks to @FeLoe , we have now a solution that supports both installations with and without the correct database scheme (see PR #313 )