Problem accessing the document ID

sdauletau / elasticsearch-position-similarity

Elasticsearch term position similarity plugin

Apache License 2.0

70 stars 22 forks source link

Problem accessing the document ID #1

Open mhmen opened 7 years ago

mhmen commented 7 years ago

Hi, I used your plugin to develop my own similarity plugin. Everything goes well except that I can score each matching term of the documents in the score() method, but the Lucene itself sums up the scores of the matching terms for each document, but I want to return the maximum score of the matching terms as the score of the document. After a lot of attempts I haven't found out any way to know that two matching terms are for the same document. Is there any way to do so? Thanks in advance MHM

sdauletau commented 7 years ago

You have access to all terms in the score method. I sum up all scores but you can select a max.

mhmen commented 7 years ago

Thanks for your reply But my problem that the score method is called for each matching term and each time the termStats only contains one term and I can score each matching term not all the matching terms of a document and something is happening in the lucene itself which sums up the score for matching terms and returns the summation as the score of the document. Using this method: context.reader().document(doc).getField("_uid").stringValue(); in the score method I can get the document uid of each matching term but if I want to use that I need to use something like a HashMap which will take a huge amount of Java heap memory when the matching terms are high and it is not efficient at all. And the other problem about this approach is that I need to clear the HashMap for every new query, but I didn't find any function that is called only when a new query comes.

mhmen commented 7 years ago

Let me say it clearly I want to have a custom similarity that scores the documents based on their edit distance with the query terms. And I want to use the maximum score of the terms not sum of them, but lucene is summing them up and saving the maximum for each document requires something like a HashMap which is not memory efficient at all and we cannot clear it because there is no method which is only called when a new query comes.

mhmen commented 7 years ago

The other problem which is degrading the performance is that I don't have access to query terms. In consequence I decided to use function score query to get a parameter and I wrote a custom script engine service to take that parameter. Is there any way to access the query terms?