Open stell opened 3 years ago
Hello,
I noticed same kind of issue on YunoHost's documentation. If I search for "revers", I got 5 pages with "reverse" or "Reverse" results but if I add a "e" (for "reverse"), I got only one result.
Everything is up to date and freshly indexed. I have two pages containing the word "geokoordinaten" which is German and stands for Geocoordinates.
When searching for "geokoordinaten" i get 2 results, which is correct.
"geokoordina" outputs 0 results. "geokoo" outputs 2 results again.
On the other hand if searching for "geoko" i get 2 results again, but different pages. Should this not output 4 pages?
I don't know if this is an issue or something is badly setup, but i don't see whats the problem here.
You can try it out here: https://www.jfewo.de/docs/de/suche?q=geokoordinaten
This is the config:
enabled: true search_route: /suche query_route: /s built_in_css: true built_in_js: true built_in_search_page: true enable_admin_page_events: true search_type: auto fuzzy: false phrases: true stemmer: german display_route: true display_hits: true display_time: true live_uri_update: true limit: "20" min: "4" snippet: "300" index_page_by_default: true scheduled_index: enabled: false at: "* 2 * * 1-7" logs: logs/tntsearch-index.out filter: items: - root@.descendants powered_by: true search_object_type: Grav
Try disabling the stemmer and rebuilding the index after that. See if that helps.
Already tried that. Same irregular results.
Just tried your link ant it shows 2 same results for "geokoordina" and "geokoo".
Cannot confirm. "geokoordina" outputs 0 results "geokoo" outputs 2 results "geoko" outputs 2 (different) results
In fact all of this should output more that 2 results.
I don't know enough about German language but the difference between "geokoordina" and "geokoordinat" looks like really a stemmer issue. Setting stemmer to 'no' should return the same results for both queries.
Yeah, I thought the same.
Without stemmer:
[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('geokoordina').PHP_EOL;"
geokoordina
[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('geokoordinat').PHP_EOL;"
geokoordinat
[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('geokoordinaten').PHP_EOL;"
geokoordinaten
With stemmer:
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('geokoordina').PHP_EOL;"
geokoordina
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('geokoordinat').PHP_EOL;"
geokoordinat
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('geokoordinaten').PHP_EOL;"
geokoordinat
Check if you really disabled the stemmer, i.e. set it to 'no', because at least the last difference is a stemmer issue. Stemming is a complex beast, so to properly debug issue I would disable it for now.
Like I said. I tested it without stemmer at the beginning and two more times after. Results are the same.
I tested setting stemmer: no
and stemmer: "no"
which is saved from admin backend.
Same here:
But if I add the final "e", I get no results:
My config:
enabled: true
search_route: /search
query_route: /s
built_in_css: true
built_in_js: true
built_in_search_page: true
enable_admin_page_events: true
search_type: auto
fuzzy: true
phrases: true
stemmer: "no"
display_route: true
display_hits: true
display_time: true
live_uri_update: true
limit: '20'
min: '3'
snippet: '300'
index_page_by_default: true
scheduled_index:
enabled: false
at: '30 3 * * *'
logs: logs/tntsearch-index.out
filter:
items:
- root@.descendants
published: true
powered_by: false
search_object_type: Grav
PHP:
PHP 7.4.21 (cli) (built: Jun 29 2021 15:17:15) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.21, Copyright (c), by Zend Technologies
Grav and TNTSearch are up to date.
Did you rebuild index after disabling the stemmer? Delete old index file fully.
Again this looks like a German stemmer issue:
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('spectr').PHP_EOL;"
spectr
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('spectre').PHP_EOL;"
spectr
[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('spectr').PHP_EOL;"
spectr
[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('spectre').PHP_EOL;"
spectre
Of course, I've rebuilt the index several times. And the German stemmer was never activated at any time. I've even deleted the index files before reindexing to make sure they are built from scratch.
...
Added 335 /de/software/autohotkey/u3helper/tutorial
Added 336 /de/software/autohotkey/u3helper/u3helper-vs-packagefactory
Total rows 336
Indexed in 6.8s
mbirth@server /...> bin/plugin tntsearch query Spectre
{
"number_of_hits": 0,
"execution_time": "1.0729 ms"
}
mbirth@server /...>
EDIT: It looks like the indexing process does something weird, as I can't find the word "Spectre" in the index:
sqlite> select * from wordlist where term="spectre";
sqlite> select * from wordlist where term="spectr";
id|term|num_hits|num_docs
5201|spectr|9|4
Same problem is still present here.
Okay, I don't know why, but it seems the Indexer still used the PorterStemmer even though I had "no" in my config. Now after changing the value via the Grav Admin interface and then setting it back to "no" via text editor (the same thing I did yesterday), it seems to work correctly and the word "Spectre" is indexed fine.
On a sidenote: Selecting "Disable" from the Grav Admin thingy turns the Yaml into stemmer: no
which translates into stemmer: false
or stemmer: 0
and makes the indexer trying to load a class 0Stemmer
which fails. The correct entry has to be stemmer: 'no'
for it to work.
Interesting indeed. Could be related to https://github.com/trilbymedia/grav-plugin-tntsearch/pull/116 which is still waiting for merge, unfortunately. Also check https://github.com/teamtnt/tntsearch/pull/243/files . Not sure which Grav version you are using and how up-to-date TNTSearch library it includes.
I can also confirm, on v3.3.1, Grav v1.7.18. Stemmer does make a difference, but disabling it doesn't fix the problem.
fuzzy
is false
With stemmer
set to English (porter
):
enc
finds encode, doesn't find encryption and encryptedencr
finds encryption and encryptedencry
doesn't find anythingencryp
and encrypt
finds both ecryption and encryptedencrypte
and encrypted
find encryptedencrypti
and encryptio
doesn't find anythingencryption
finds encryption
when it finds something, it find both posts.With stemmer
set to 'no'
or default
:
encrypt
find encryption and encrypted, but only one articleencrypti
to encryption
finds encryption in two articlesI don't know if I'm setting something wrong, but this is too unreliable.
@bgdnlp try with patches in https://github.com/teamtnt/tntsearch/pull/243/files and #116
@bgdnlp try with patches in https://github.com/teamtnt/tntsearch/pull/243/files and #116
this fixed my problem. thanks
I've updated to the latest Grav 1.7.23 and applied the changes noted in https://github.com/trilbymedia/grav-plugin-tntsearch/issues/114#issuecomment-890494951 but I'm still not getting the desired results.
Test case is a search for "spk", which should return "spk1000" and "spk7457", but only the first appears:
A search for "spk7", returns "spk7457", which should also appear in the previous search:
I don't believe I've missed anything, but here is a diff showing the changes I've applied:
diff --git a/user/config/plugins/tntsearch.yaml b/user/config/plugins/tntsearch.yaml
index a1ea9789..05a15902 100644
--- a/user/config/plugins/tntsearch.yaml
+++ b/user/config/plugins/tntsearch.yaml
@@ -8,7 +8,7 @@ enable_admin_page_events: true
search_type: auto
fuzzy: false
phrases: true
-stemmer: 'default'
+stemmer: 'no'
display_route: true
display_hits: true
display_time: true
diff --git a/user/plugins/tntsearch/classes/GravTNTSearch.php b/user/plugins/tntsearch/classes/GravTNTSearch.php
index f5a1082d..9e9a75ac 100644
--- a/user/plugins/tntsearch/classes/GravTNTSearch.php
+++ b/user/plugins/tntsearch/classes/GravTNTSearch.php
@@ -42,7 +42,7 @@ class GravTNTSearch
$locator = Grav::instance()['locator'];
$search_type = $config->get('plugins.tntsearch.search_type', 'auto');
- $stemmer = $config->get('plugins.tntsearch.stemmer', 'default');
+ $stemmer = $config->get('plugins.tntsearch.stemmer', 'no');
$limit = $config->get('plugins.tntsearch.limit', 20);
$snippet = $config->get('plugins.tntsearch.snippet', 300);
$data_path = $locator->findResource('user://data', true) . '/tntsearch';
@@ -225,8 +225,10 @@ class GravTNTSearch
$this->tnt->setDatabaseHandle(new GravConnector);
$indexer = $this->tnt->createIndex($this->index);
- // Set the stemmer language if set
- if ($this->options['stemmer'] !== 'default') {
+ // Disable stemmer for users with older configuration.
+ if ($this->options['stemmer'] == 'default') {
+ $indexer->setLanguage('no');
+ } else {
$indexer->setLanguage($this->options['stemmer']);
}
@@ -340,4 +342,4 @@ class GravTNTSearch
return $fields;
}
-}
+}
\ No newline at end of file
diff --git a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php
index b96e6dd1..50096aad 100644
--- a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php
+++ b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php
@@ -2,7 +2,7 @@
namespace TeamTNT\TNTSearch\Classifier;
-use TeamTNT\TNTSearch\Stemmer\PorterStemmer;
+use TeamTNT\TNTSearch\Stemmer\NoStemmer;
use TeamTNT\TNTSearch\Support\Tokenizer;
class TNTClassifier
@@ -18,7 +18,7 @@ class TNTClassifier
public function __construct()
{
$this->tokenizer = new Tokenizer;
- $this->stemmer = new PorterStemmer;
+ $this->stemmer = new NoStemmer;
}
public function predict($statement)
@@ -128,4 +128,4 @@ class TNTClassifier
$this->tokenizer = $classifier->tokenizer;
$this->stemmer = $classifier->stemmer;
}
-}
+}
\ No newline at end of file
diff --git a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php
index 1742d3ae..8182d4aa 100644
--- a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php
+++ b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php
@@ -13,7 +13,7 @@ use TeamTNT\TNTSearch\Connectors\SQLiteConnector;
use TeamTNT\TNTSearch\Connectors\SqlServerConnector;
use TeamTNT\TNTSearch\FileReaders\TextFileReader;
use TeamTNT\TNTSearch\Stemmer\CroatianStemmer;
-use TeamTNT\TNTSearch\Stemmer\PorterStemmer;
+use TeamTNT\TNTSearch\Stemmer\NoStemmer;
use TeamTNT\TNTSearch\Support\Collection;
use TeamTNT\TNTSearch\Support\Tokenizer;
use TeamTNT\TNTSearch\Support\TokenizerInterface;
@@ -41,7 +41,7 @@ class TNTIndexer
public function __construct()
{
- $this->stemmer = new PorterStemmer;
+ $this->stemmer = new NoStemmer;
$this->tokenizer = new Tokenizer;
$this->filereader = new TextFileReader;
}
@@ -71,7 +71,7 @@ class TNTIndexer
if (!isset($this->config['driver'])) {
$this->config['driver'] = "";
}
-
+
if (!isset($this->config['wal'])) {
$this->config['wal'] = true;
}
@@ -131,9 +131,9 @@ class TNTIndexer
}
/**
- * @param string $language - one of: arabic, croatian, german, italian, porter, russian, ukrainian
+ * @param string $language - one of: no, arabic, croatian, german, italian, porter, portuguese, russian, ukrainian
*/
- public function setLanguage($language = 'porter')
+ public function setLanguage($language = 'no')
{
$class = 'TeamTNT\\TNTSearch\\Stemmer\\'.ucfirst(strtolower($language)).'Stemmer';
$this->setStemmer(new $class);
@@ -178,7 +178,7 @@ class TNTIndexer
$this->index = new PDO('sqlite:'.$this->config['storage'].$indexName);
$this->index->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
- if($this->config['wal']) {
+ if ($this->config['wal']) {
$this->index->exec("PRAGMA journal_mode=wal;");
}
@@ -306,7 +306,7 @@ class TNTIndexer
if ($counter % 10000 == 0) {
$this->index->commit();
$this->index->beginTransaction();
- $this->info("Commited");
+ $this->info("Committed");
}
}
$this->index->commit();
@@ -692,4 +692,4 @@ class TNTIndexer
echo $text.PHP_EOL;
}
}
-}
+}
\ No newline at end of file
@thekenshow your case is different than this issue. This issue deals with stemmer which operates only on normal words. If the numbers are involved you should create a separate issue ticket.
Ah, good to know, thanks. Filed a new issue.
Everything is up to date and freshly indexed. I have two pages containing the word "geokoordinaten" which is German and stands for Geocoordinates.
When searching for "geokoordinaten" i get 2 results, which is correct.
"geokoordina" outputs 0 results. "geokoo" outputs 2 results again.
On the other hand if searching for "geoko" i get 2 results again, but different pages. Should this not output 4 pages?
I don't know if this is an issue or something is badly setup, but i don't see whats the problem here.
You can try it out here: https://www.jfewo.de/docs/de/suche?q=geokoordinaten
This is the config: