Closed derry43 closed 1 year ago
After a bit of digging, this issue appears to be cause by lines 1438 to 1448 in src/KnowbaseItem.php which replaces all of the Boolean Full-Text Search (BOOLEAN MODE) operators with spaces, resulting in them being ignored.
// Replace all non word characters with spaces (see: https://stackoverflow.com/a/26537463)
$search_wilcard = preg_replace('/[^\p{L}\p{N}_]+/u', ' ', $search);
// Remove last space to avoid illegal syntax with " *"
$search_wilcard = trim($search_wilcard);
// Merge spaces since we are using them to split the string later
$search_wilcard = preg_replace('!\s+!', ' ', $search_wilcard);
$search_wilcard = explode(' ', $search_wilcard);
$search_wilcard = implode('* ', $search_wilcard) . '*';
I believe that this should be:
// Replace all non word/operator characters with spaces (see: https://stackoverflow.com/a/26537463)
$search_wilcard = preg_replace('/[^\p{L}\p{N}_)(~+\<>*"-]+/u', ' ', $search);
// Remove last space to avoid illegal syntax with " *"
$search_wilcard = trim($search_wilcard);
// Merge spaces since we are using them to split the string later
$search_wilcard = preg_replace('!\s+!', ' ', $search_wilcard);
$search_wilcard = explode(' ', $search_wilcard);
$search_wilcard = implode('* ', $search_wilcard);
Similar lines in 9.5.7 inc/knowbaseitem_class.php
can be updated to fix for current release.
We also need to account for url encoding for the operators in the pager code between lines 1685 and 1687 in src/KnowbaseItem.php
, where ...
$parameters = "start=" . $params["start"] . "&knowbaseitemcategories_id=" .
$params['knowbaseitemcategories_id'] . "&contains=" .
$params["contains"] . "&is_faq=" . $params['faq'];
becomes ...
$parameters = "start=" . $params["start"] . "&knowbaseitemcategories_id=" .
$params['knowbaseitemcategories_id'] . "&contains=" .
rawurlencode($params["contains"]) . "&is_faq=" . $params['faq'];
Again, Similar lines in 9.5.7 file inc/knowbaseitem_class.php
can be updated to fix for current release.
The proposed change seems to cause the initial issue to return that #9784 was trying to fix.
I also cannot get a query like +Convert -Windows
to work correctly as it shows an article named Convert Windows Server from Desktop to Core
.
There seems to be more wrong here. The resulting query seems to correctly select a "score" for the text search but then in the WHERE statement it looks for any records with the title or content containing Convert Windows
(to use my previous example). If using boolean mode operators, it shouldn't have those clauses and should enforce that only results with a score more than 0 are returned. If not using boolean operators, then the existing behavior is OK as the fulltext search score should only be relevant for the ORDER to show the most relevant articles first.
Took another look at this, and you are correct there is still an issue with how the boolean full-text search operators are handled. I have done a little more work on this and, although what I have produced does not handle correcting all syntax issues, hopefully it cleans up the majority of offending syntax and allows the search operators (including grouping and quoting) to be used successfully. As a bonus, this allows searching for other "special" characters, such as !
, as long as they are included in double quotes e.g. "!error"
.
Here is what I propose:
Original code:
// Replace all non word characters with spaces (see: https://stackoverflow.com/a/26537463)
$search_wilcard = preg_replace('/[^\p{L}\p{N}_]+/u', ' ', $search);
// Remove last space to avoid illegal syntax with " *"
$search_wilcard = trim($search_wilcard);
// Merge spaces since we are using them to split the string later
$search_wilcard = preg_replace('!\s+!', ' ', $search_wilcard);
$search_wilcard = explode(' ', $search_wilcard);
$search_wilcard = implode('* ', $search_wilcard) . '*';
New code:
// Replace all non word/operator characters with spaces (see: https://stackoverflow.com/a/26537463), ignore those inside quotes
$search_wilcard = preg_replace('/[^\p{L}\p{N}_)(~+\<>*\\\\"-]+(?=([^"]*"[^"]*")*[^"]*$)/u', ' ', $search);
// Remove escape character from quotes
$search_wilcard = preg_replace('![\\\\]["]!u', '"', $search_wilcard);
// Remove all isolated groups of operators, ignore those inside quotes
// i.e. those operators which are neither a prefix nor a suffix for a word, group or phrase
$search_wilcard = preg_replace('!(?<=\s)[~\<>*+-]+(?=\s)(?=([^"]*"[^"]*")*[^"]*$)!u', '', $search_wilcard);
// Remove all * operators immediately following parentheses or quoted terms, ignore those inside quotes
$search_wilcard = preg_replace('!(?<=[)("])[*](?=([^"]*"[^"]*")*[^"]*$)!u', '', $search_wilcard);
// Insert space between closing parentheses and immediately following operator, ignore those inside quotes
$search_wilcard = preg_replace('!(?<=[)])([~\<>+-])(?=([^"]*"[^"]*")*[^"]*$)!u', ' \1', $search_wilcard);
// Replace any * operators immediately following a space, ignore those inside quotes
$search_wilcard = preg_replace('!(?<=\s)[*](?=([^"]*"[^"]*")*[^"]*$)!u', ' ', $search_wilcard);
// Replace groups of ~, >, <, + or - operators with rightmost operator, ignore those inside quotes
$search_wilcard = preg_replace('![~\<>+-]+(?=[~\<>+-])(?=([^"]*"[^"]*")*[^"]*$)!u', '', $search_wilcard);
// Merge spaces, ignore those inside quotes
$search_wilcard = preg_replace('!\s+(?=([^"]*"[^"]*")*[^"]*$)!u', ' ', $search_wilcard);
The above handles the search term test* /* test* (* test3* )*
(initial fix attempt in #9784) and will effectively tidy this up to be test* test* ( test3* )
. It also handles quite a few other syntactically incorrect searches, such as ~+test
which becomes +test
, *test
which becomes test
, and *+
etc.
One problem still remains, handling of >
and <
characters in the search. The sanitization of >
generates this &#62;
which un-sanitizes as >
which is incorrect. The sanitization of <
generates this &#60;
which un-sanitizes as <
which is also incorrect. Looks like the code in src/Toolbox/Sanitizer.php
which handles the sanitization of the original search term appears to convert the >
to >
and then convert the &
of the >
to &
, resulting in something that is irreversible.
To detect and alert on syntax errors for boolean full-text search, the following function can be added to src/KnowbaseItem.php
and the showList
function updated at line 1688 to include a call to the checkBoolFtSyntax
function.
/**
* Check Boolean Full-Text Search Syntax
*
* @param $searchterm raw text version of search term e.g. "+word -search"
* @param $pos return the position of first syntax issue, default 0
**/
public static function checkBoolFtSyntax($searchterm, &$pos = 0)
{
$matches = null;
$returnValue = preg_match_all('/\s*([~+-]?\(\s*((?>(\s*[><~+-]?([A-Za-z_]+|"[^\\"]*(\\"[^\\"]*)*")[*]?(\s|$|(?=[)])))+)\s*|\s*(?R)\s*)*\s*\)|(?>(\s*[><~+-]?([A-Za-z_]+|"[^\\"]*(\\"[^\\"]*)*")[*]?(\s|$))+))\s*/', $searchterm, $matches);
foreach ($matches as $key => $value)
{
if (is_array($value)) {
foreach ($value as $ikey => $ivalue) {
$searchterm=str_replace($ivalue,str_repeat(" ",strlen($ivalue)),$searchterm);
}
} else {
$searchterm=str_replace($value,str_repeat(" ",strlen($value)),$searchterm);
}
}
if (preg_match('/\S/',$searchterm,$matches)) {
$pos=strpos($searchterm,ltrim($searchterm));
return true;
} else {
return false;
}
}
if (self::checkBoolFtSyntax(Sanitizer::unsanitize($params["contains"]), $errorpos)) {
$errorchar=substr($params["contains"],$errorpos,10);
echo '<div style="color:red;" class="d-flex justify-content-center">'.__('Search syntax error at ').$errorpos.", '".$errorchar."...'</div><br/><br/>";
return false;
}
Again, Similar lines in 9.5.7 file inc/knowbaseitem_class.php
can be updated to fix for current release.
Can you open a pull request with your changes?
Can you open a pull request with your changes?
Yes will try. Can you advise on the correct way to generate the error message for the search term syntax check above. Not sure if the method I used is really ideal and can't seem to use Session::addMessageAfterRedirect
as there is no-redirect after the check and so this does not show until after a further search.
addMessageAfterRedirect should work since searching the knowledgebase refreshes the page. The name of the function is a bit wrong as it adds a message that will get displayed the next time any page is loaded.
If you open a PR with all the changes, I can test and suggest changes.
Hello @derry43, were you able to combine your patches for a pull request? If not, could you let me know which changes from this thread as still relevant so we have a starting point?
Hello @derry43, were you able to combine your patches for a pull request? If not, could you let me know which changes from this thread as still relevant so we have a starting point?
Eventually managed to create a PR, not able to perform tests as I do not have a full grasp of that part of development here.
Fixed by #14301.
Code of Conduct
Is there an existing issue for this?
Version
10.0 RC3
Bug description
The Boolean Full-Text Search operators (+, -, ~, *, <, >, () and "") are ignored in the Knowledge base search and therefor do not work as per the documentation (doc excerpt below). A possible fix is provided in the additional comments made after raising this bug report.
My experience has shown that searching for articles with lines that must contain both the words "tribbles" and "with" by using the search expression
+tribbles +with
will return all articles containing either "with" or "tribbles", the same as simply using the search termtribbles with
. Effectively ignoring the+
operator.And searching for articles that must contain the literal sequence "with tribbles" by using the search expression
"with tribbles"
returns all articles containing either "with" or "tribbles", the same as simply using the search termtribbles with
. Effectively ignoring the"
quotes.Worse still, searching for articles that must contain "tribbles" but must not contain "kirk" by using the search expression
+tribbles -kirk
also returns all articles containing either "tribbles" or "kirk", the same as simply using the search termtribbles kirk
. Effectively ignoring the-
operator.Documentation on Knowledge Base Search Operators
Relevant log output
No response
Page URL
front/knowbaseitem.php
Steps To reproduce
+tribbles +with
"with tribbles"
+tribbles -kirk
Your GLPI setup information
Information about system installation and configuration
Server
GLPI constants
Libraries
LDAP directories
SQL replicas
Notifications
Plugins list
Anything else?
No response