nextapps-de / flexsearch

Next-Generation full text search library for Browser and Node.js
Apache License 2.0
12.53k stars 491 forks source link

@0.6.32 : index incorrectly sorts items that are repeated #234

Closed haysclark closed 3 years ago

haysclark commented 3 years ago

I ran into this issue when adding the filter option to create(). Basically, when you repeat the same word over an over, the number of times it was used is lost if it's the only word in the string. Resulting in incorrect sorting when performing a search. ...maybe Tora! Tora! Tora! would be a real world example.

I put together an interactive demo of the issue on CodeSandbox.

import FlexSearch from "flexsearch"

type IndexableDocument = Record<string, unknown>

const index = FlexSearch.create<IndexableDocument>({
  filter: value => value.length > 1,
})
const documents: IndexableDocument[] = [
  { id: `one`, body: `bird cat dog` },
  { id: `two`, body: `cat dog cat` },
  { id: `none`, body: `bird fish dog` },
  { id: `five`, body: `cat fish cat fish cat cat cat` },
  { id: `four`, body: `cat bird cat cat cat` },
  { id: `three`, body: `cat cat cat` },
]
documents.forEach(({ id, ...indexFields }) => {
  const serializedDoc = JSON.stringify(indexFields)
  index.add(id as number, serializedDoc)
})

const results = index.search("cat")
console.log(results) 

// outputs: [ 'five', 'four', 'one', 'two', 'three' ] 
// expected: [ 'five', 'four', 'three', 'two', 'one' ]
haysclark commented 3 years ago

Hmmm... I think will close these because the example I put together was too contrived. Using a 'filter' does not add sorting, it was just a coincident that items were sorted in the order I was hoping for.

ts-thomas commented 3 years ago

Your example did not cover a real world example. The entropy of your pseudo contents is just too low. FlexSearch is optimized for a real fulltext-search. The result from your example isn't wrong. You just expect a different sorting which you will get when you add real text content and use real queries (which probably exists of more than one term).

ts-thomas commented 3 years ago

A solution could be found here: #236