Closed NBA707 closed 1 month ago
It sounds like the issue is related to how TNT Search handles hyphens with its default tokenizer, which likely splits ABC123-1
into separate tokens like ABC123
and 1
. To fix this, you can customize the tokenizer to not split at hyphens. After changing the tokenizer, you'll need to rebuild your index to apply this new setting.
To check how the tokens are being stored and associated with your documents, you can directly query the SQLite database used by TNTSearch. Look into the wordlis
t and hitlist
tables; wordlist stores the tokens and document frequency, while hitlist maps which token appears in which document.
If you adjust the tokenizer and are still facing issues, make sure to reindex your data so that the changes take effect.
So I tried to add the - to ./vendor/teamtnt/tntsearch/src/Support/Tokenizer.php:
static protected $pattern = '/[^\p{L}\p{N}\p{Pc}\p{Pd}\-@]+/u';
rebuilding it now.
Instead of this being hardcoded, in the future could it be in a config file?
TNTSearch comes with a bunch of predefined tokenizers, all of them are here
You might be looking for the ProductTokenizer
Also, you can write your own toknizer that extends AbstractTokenizer
and implements TokenizerInterface
and then just in the config you put the tokenizer you want to use
Ah, so I could use the ProductTokenizer and it would be better?
Yes :)
In the config/scout.php how would I set the tokenizer?
` 'tntsearch' => [ 'storage' => storage_path(), //place where the index files will be stored 'fuzziness' => env('TNTSEARCH_FUZZINESS', false), 'fuzzy' => [ 'prefix_length' => 2, 'max_expansions' => 50, 'distance' => 2, 'no_limit' => true ], 'asYouType' => false, 'searchBoolean' => env('TNTSEARCH_BOOLEAN', false), 'maxDocs' => env('TNTSEARCH_MAX_DOCS', 500), ],
`
In the config file try it like this:
'tokenizer' => \TeamTNT\TNTSearch\Support\ProductTokenizer::class
Ah, it looks like if I create a command to build the index, the tokenizer is saved in that index,so search will honor whatever was used at index creation?
yes, that is correct, you only have to make sure the class exists if you're using a custom tokenizer
Loving the speed of TNT Search. All my searches are for part numbers with - in them.
One issue I have seen is that if I do like ABC123-1 it returns 0 results, ABC123 it finds like 1,505 records.
How does one go about troubleshooting why searches are failing when the data is in field being indexed on?