Unreliable search results

teamtnt / laravel-scout-tntsearch-driver

Driver for Laravel Scout search package based on https://github.com/teamtnt/tntsearch

MIT License

1.09k stars 142 forks source link

Unreliable search results #370

Closed NBA707 closed 1 month ago

NBA707 commented 1 month ago

Loving the speed of TNT Search. All my searches are for part numbers with - in them.

One issue I have seen is that if I do like ABC123-1 it returns 0 results, ABC123 it finds like 1,505 records.

How does one go about troubleshooting why searches are failing when the data is in field being indexed on?

nticaric commented 1 month ago

It sounds like the issue is related to how TNT Search handles hyphens with its default tokenizer, which likely splits ABC123-1 into separate tokens like ABC123 and 1. To fix this, you can customize the tokenizer to not split at hyphens. After changing the tokenizer, you'll need to rebuild your index to apply this new setting.

To check how the tokens are being stored and associated with your documents, you can directly query the SQLite database used by TNTSearch. Look into the wordlist and hitlist tables; wordlist stores the tokens and document frequency, while hitlist maps which token appears in which document.

If you adjust the tokenizer and are still facing issues, make sure to reindex your data so that the changes take effect.

NBA707 commented 1 month ago

So I tried to add the - to ./vendor/teamtnt/tntsearch/src/Support/Tokenizer.php:

static protected $pattern = '/[^\p{L}\p{N}\p{Pc}\p{Pd}\-@]+/u';

rebuilding it now.

NBA707 commented 1 month ago

Instead of this being hardcoded, in the future could it be in a config file?

nticaric commented 1 month ago

TNTSearch comes with a bunch of predefined tokenizers, all of them are here

You might be looking for the ProductTokenizer

Also, you can write your own toknizer that extends AbstractTokenizer and implements TokenizerInterface and then just in the config you put the tokenizer you want to use

NBA707 commented 1 month ago

Ah, so I could use the ProductTokenizer and it would be better?

nticaric commented 1 month ago

Yes :)

NBA707 commented 1 month ago

In the config/scout.php how would I set the tokenizer?

` 'tntsearch' => [ 'storage' => storage_path(), //place where the index files will be stored 'fuzziness' => env('TNTSEARCH_FUZZINESS', false), 'fuzzy' => [ 'prefix_length' => 2, 'max_expansions' => 50, 'distance' => 2, 'no_limit' => true ], 'asYouType' => false, 'searchBoolean' => env('TNTSEARCH_BOOLEAN', false), 'maxDocs' => env('TNTSEARCH_MAX_DOCS', 500), ],

nticaric commented 1 month ago

In the config file try it like this:

'tokenizer' => \TeamTNT\TNTSearch\Support\ProductTokenizer::class

NBA707 commented 1 month ago

Ah, it looks like if I create a command to build the index, the tokenizer is saved in that index,so search will honor whatever was used at index creation?

nticaric commented 1 month ago

yes, that is correct, you only have to make sure the class exists if you're using a custom tokenizer