panosc-eu / panosc-search-scoring

3 stars 3 forks source link

`diffraction` vs `diff` #12

Open antolinos opened 2 years ago

antolinos commented 2 years ago

I was expecting that because the implementation is doing stemming:

applying stemming to all words and create terms

to get more items that are matched when the query=diff than query=diffraction

However, my tests demonstrate the opposite, the query diffraction has 239 results whilst the query diff has 0 results.

For the records, this is the list of terms that contains the word diff:

JSON.stringify(terms.filter(t => t.term.search("diff")!= -1))
[
{\"term\":\"ndiffract\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"micodiffract\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"4864differ\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"diffraction10\",\"numberOfItems\":7,\"numberOfGroups\":1},
{\"term\":\"interdiffus\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"difficil\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"diffract\",\"numberOfItems\":239,\"numberOfGroups\":1},
{\"term\":\"206diffract\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"nanodiffraction10\",\"numberOfItems\":1,\"numberOfGroups\":1},
{\"term\":\"difficulti\",\"numberOfItems\":5,\"numberOfGroups\":1},
{\"term\":\"differenti\",\"numberOfItems\":4,\"numberOfGroups\":1},
{\"term\":\"microdiffraction10\",\"numberOfItems\":3,\"numberOfGroups\":1},
{\"term\":\"microdiffract\",\"numberOfItems\":3,\"numberOfGroups\":1},
{\"term\":\"nanodiffract\",\"numberOfItems\":2,\"numberOfGroups\":1},
{\"term\":\"diffus\",\"numberOfItems\":15,\"numberOfGroups\":1},
{\"term\":\"difficult\",\"numberOfItems\":38,\"numberOfGroups\":1},
{\"term\":\"differ\",\"numberOfItems\":322,\"numberOfGroups\":1},
{\"term\":\"diffractomet\",\"numberOfItems\":2,\"numberOfGroups\":1}]

My questions are:

  1. Should not we expect a term called diff because is the root?

  2. Is it a problem if a user queries by diff and there is no result?

nitrosx commented 2 years ago

@antolinos : at the moment the scoring does not apply partial matching to the terms. The use case that you highlighted in this issue is totally possible.

To answer your questions:

  1. If you search for diff, only the items that contains exactly the term generated applying lemmatization to the word diff will be scored and returned. We should discuss more about this topic if we want to return all the items that contains term which include diff
  2. At the moment, I do not see that as a problem, but I'm open for discussing the topic and how to implement the necessary changes if they are approved.