ihsn / nada

National Data Archive (NADA) is an open source data cataloging system that serves as a portal for researchers to browse, search, compare, apply for access, and download relevant census or survey information. It was originally developed to support the establishment of national survey data archives.
http://nada.ihsn.org
MIT License
38 stars 10 forks source link

Boolean OR operator in catalog search (MySQL) #78

Open RogerGee opened 1 year ago

RogerGee commented 1 year ago

Hello!

I'd like to ask about the catalog search functionality, specifically regarding Boolean full-text searches using MySQL. In Catalog_search_mysql.php, it appears that keywords are being fixed so that the default behavior is a conjunction of search terms. Since a disjunction of search terms (i.e. operator OR) is the default in MySQL fulltext search, the keyword modifications make it syntactically impossible to build an expression with Boolean OR functionality.

My question is whether I'm parsing the behavior correctly and whether there is a capability to search with OR that I may be missing. Thank you for any insights.

mah0001 commented 1 year ago

@RogerGee - I am sorry it has taken so long to reply. The full-text search in NADA does not support OR, it always performs an AND query combining all keywords as you have guessed. We will add support for OR in the next release of NADA.

For a better full-text search, NADA supports Solr (https://solr.apache.org/). Solr offers a much better search and lots of options to configure and tune the search according to your needs.

RogerGee commented 1 year ago

@mah0001 Thank you very much for your response. I will look into Solr integration and appreciate the suggestion.

Kindly let me know if you'd like me to close this issue or leave it open pending support for operator OR in the next release of NADA.

RogerGee commented 1 year ago

I configured an instance of NADA to connect to a Solr server. After getting everything indexed in Solr, I tried a few searches with Solr's Boolean operator syntax. The syntax didn't work. It appears, as with MySQL FULLTEXT, that the operators are getting stripped out of the search string. (See this snippet of the relevant source code that escapes terms.)

Please let me know if you have any recommendations or if I am missing something obvious with Solr. Otherwise, I will wait on future releases of NADA that may support more complex search expressions. Thank you.

mah0001 commented 1 year ago

@RogerGee - The terms are not stripped, the method $this->escapeTerm escapes special characters such as spaces to be passed to SOLR. Without escaping, search queries will fail.

Try this: 1 - Enable debug mode for SOLR so you can see the actual queries run by SOLR. Edit application/config/solr and change the value for the setting 'solr_debug' to 'true':

$config['solr_debug']=true;

2 - Now try a search on the catalog search page, it will print the raw query executed by SOLR. You should be able to see if anything has been removed.

3- Copy the raw query and try directly in SOLR UI to see if you get same results.

Change mm (minimum match) options (for more info on mm, see https://solr.apache.org/guide/7_7/the-extended-dismax-query-parser.html).

4 - In the 'application/config/solr.php' file, change the value for 'mm' to '0%'.

'mm'='0%',

Here is screenshot with the Debug mode enabled:

image
RogerGee commented 12 months ago

@mah0001 Thanks so much for your detailed response. I'm not well-versed in Solr, so it was very helpful!

I walked through the steps you outlined and noticed something peculiar. When I execute a query similar to yours (i.e. using the Boolean operator NOT), the spaces in my query are escaped whereas in your query they remain unescaped. When I run the query (with escaped spaces) in the Solr UI, I get the same unexpected results as in NADA. If I take out the backslashes, the results are correct. I presume this is because an escaped space is not treated as a delimiter for parsing a Boolean operator.

For example, when I search oklahoma NOT health, the q query parameter gets set to oklahoma\ NOT\ health. When you ran population NOT albania, the query parameter was set correctly to population NOT albania (i.e. without backslashes). I'm not sure why it would work one way for you and differently for me. See screenshot below.

Screen Shot 2023-11-14 at 11 24 09 AM

(As you can see, the first search result contains the word health, which is incorrect.)

I verified that this is happening due to Solarium\Core\Query\Helper::escapeTerm. When I run the snippet of code below, I get the value with the escaped spaces:

<?php

require_once 'vendor/autoload.php';

$config = [
    'endpoint' => [
        'localhost' => [
            'host' => '127.0.0.1',
            'port' => 8983,
            'path' => '/',
            'core' => 'nada',
        ],
    ],
];

$adapter = new Solarium\Core\Client\Adapter\Curl;
$event_dispatcher = new Symfony\Component\EventDispatcher\EventDispatcher;
$solr_client =  new Solarium\Client($adapter,$event_dispatcher,$config);

$query = $solr_client->createSelect();
$helper = $query->getHelper();

$search = 'oklahoma NOT state';
echo $helper->escapeTerm($search) . PHP_EOL;

$search = 'population NOT albania';
echo $helper->escapeTerm($search) . PHP_EOL;

The result of running this code in my environment is:

oklahoma\ NOT\ state
population\ NOT\ albania

I'm using the version of Solarium that is bundled with NADA's codebase (i.e. 6.2.7).

It seems that escapeTerm is designed to escape a single term. I don't know if it is supposed to be used to prepare the entire search string. I tried to alter the implementation in Catalog_search_solr.php to parse out the terms and escape each individually. This is pretty rudimentary and doesn't handle the case where you'd want a space included in a search term.

$query_keywords = preg_split('/\s+/',$this->study_keywords);
array_walk($query_keywords,function (string &$term) use($helper) {
    $BOOLEAN_OPS = ['AND','OR','NOT'];
    if (!in_array(strtoupper($term),$BOOLEAN_OPS)) {
        $term = $helper->escapeTerm($term);
    }
});
$query->setQuery(implode(' ',$query_keywords));

I'll keep looking into this to see if there's possibly some quirk of my environment that explains this. I appreciate any comments you might have. Thanks!

mah0001 commented 12 months ago

@RogerGee Excellent debugging. You are right, escapeTerms usage was incorrect. For our NADA instance with Solr, we are not escaping the terms at all. I have pushed the changes to nada-5.3 branch. This branch has lots of other changes, it might be easier to replace your Catalog_search_solr.php with this: (https://github.com/ihsn/nada/blob/nada-5.3/application/libraries/Catalog_search_solr.php)

With the escapeTerms removed, you should be able to get the correct results for Boolean queries and you can also use various other search options (https://solr.apache.org/guide/6_6/the-standard-query-parser.html):

Here is an example of searching using specific fields:

title:"country survey" AND years:2021

Let me know if you notice any issues.