manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.05k stars 507 forks source link

Negative Assertion Full-Text Operator #4

Closed barryhunter closed 6 years ago

barryhunter commented 7 years ago

Match a keyword, but NOT when preceded/followed by some word. Not sure exact syntax, perhaps "Church -Street" (in the quotes). Would match all instances of Church, but NOT one in "Church Street".

Query: "Church -Street" Doc1: St. Mary's Church - Match (has Church) Doc2: Cars on Church Street - NOT Match (has "Church Street") Doc3: St. Mary's Church at end of Church Street - Match (has Church on its own)

The last document exemplifies why Church -"Church Street" almost but not quite works. The negative term would exclude the result despite the other Church match.

Alternatively it could be a 'Not Near' operator somehow. As in Church NOT NEAR/2 Street

airolg commented 7 years ago

Thank you for issue, I've put it to our backlog.

tomatolog commented 6 years ago

to make things clear - this will be only 2 keywords operator?

I mean in case of phrase operator modification 3 and more keywords phrase with negative assertion "Church -Street saint" set uncertain for keyword positions. NOT NEAR seems more solid here, however I'd prefer 1 token operator - maybe NEAR/NOT ?

or you need to set distance too in case NEAR will be base of this implementation?

There is also option to fix NEAR to work with following NOT term.

barryhunter commented 6 years ago

Well, "Church -Street saint" doesnt seem particularly useful, but it would be nice if worked. But ulitimate guess its kinda abigiuous. Would it be match Church saint or only Church * saint, like Church box saint (although its clear it shouldnt match Church Street saint :)

But could want multiple word phrases like "-under road bridge" (road bridges, but not under such bridges ;)

NEAR/NOT would be fine as such, franky not expecting users to type these queries

To clarify the usecase, want to create a search function, that shows 'common' phrases around the term, and then allow the user to exclude them. eg a plain keyword search for 'Church' includes photos of actual church (the building), but also contains photos mentioning 'Church Street' (not churches) or even things like 'near St Marys Church"

But also a consideration, how can 'chain' multiple assertions, like

 church NOT/NEAR street NOT/NEAR inside

ie dont want to to match "church streeet", nor "inside church" but other church matches are fine. Suppose

church NOT/NEAR (street | inside | road)

gives flexiblity...

"road bridge" NOT/NEAR under

possibly? NOT/NEAR is also a slight issue, that it not 'strict order', so "street church" also excluded, but overall not too concerned.

tomatolog commented 6 years ago

I've look at NEAR with following negative term deprecated for now and see the Church NEAR/1 -Street or Church NEAR/3 -Street might work well in case it got implemented

it also could work for complex things like "church * saint" NEAR/1 -street - ie church saint phrase with atom delta=2 and NOT street with atom delta=1 from atom position of NEAR left node. As NEAR allows complex nodes not just simple terms.

However still not clear do you need to build complex phrase and exclude single term at particular position at that phrase?

barryhunter commented 6 years ago

Church NEAR/1 -Street or Church NEAR/3 -Street might work well in case it got implemented

thats possibly not the most intuitive, but if that was implemented think would be happy.

However still not clear do you need to build complex phrase and exclude single term at particular position at that phrase?

Not really. See it most useful to exclude words before/after a given phrase, rather than in the middle of a phrase.

Although

church NEAR -"near st *"

could be useful as in Church but not "near st marys church" nor "near st georges church"

but maybe

 church NEAR/3 -near

woudl be enough

tomatolog commented 6 years ago

I've just pushed this feature to master branch. That is new full-text operator NOTNEAR works as Church NOTNEAR/2 street

Left argument of operator is MUST term, operator or group of operators that should be exists at document to match.

Right argument of operator is NOT term, operator or group of operators. There is no need to add minus to right argument.

You have to set positive non zero distance for operator like NOTNEAR/dist.

ie document a b c d e g f i k

queries 'a NOTNEAR/1 b' - not mach, as positions a+1==b 'a NOTNEAR/1 c' - mach, as positions a+1<c 'a NOTNEAR/2 c' - not mach, as positions a+2==c

complex queries '"a b" NOTNEAR/1 c' - not mach, as positions b+1==c '"a b c" NOTNEAR/3 f' - mach, as positions c+3<f

I've added description of this operator to Extended query syntax topic of documentation

You also might find examples at 349 test at source code.

barryhunter commented 6 years ago

Only just got round to testing this. Not sure if working

sphinxQL>select * from tags where match('@title church NOTNEAR/1 street') and user_id = 86708 limit 10; show meta; show plan;
+---------+---------+---------------------------------------------------+--------------------------+
| id      | user_id | title                                             | tags                     |
+---------+---------+---------------------------------------------------+--------------------------+
| 5599363 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
+---------+---------+---------------------------------------------------+--------------------------+
1 row in set (0.00 sec)

+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| total         | 1      |
| total_found   | 1      |
| time          | 0.001  |
| keyword[0]    | church |
| docs[0]       | 407    |
| hits[0]       | 411    |
| keyword[1]    | street |
| docs[1]       | 549    |
| hits[1]       | 555    |
+---------------+--------+
9 rows in set (0.00 sec)

+------------------+-------------------------------------------------------------------------------------------------------------------+
| Variable         | Value                                                                                                             |
+------------------+-------------------------------------------------------------------------------------------------------------------+
| transformed_tree | NOTNEAR(
  AND(fields=(title), KEYWORD(church, querypos=1)),
  AND(fields=(title), KEYWORD(street, querypos=2))) |
+------------------+-------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Server version: 2.6.2 5fcf016@180207 dev

Its doing something, because

sphinxQL>select * from tags where match('@title church NEAR/1 street') and user_id = 86708 limit 10; show meta; show plan;
+---------+---------+---------------------------------------------------+--------------------------------------+
| id      | user_id | title                                             | tags                                 |
+---------+---------+---------------------------------------------------+--------------------------------------+
| 5595804 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 868,891,1025,1279,3274,150437,180819 |
| 5597549 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437             |
| 5597554 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,150437                  |
| 5597564 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437             |
| 5597568 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437             |
| 5597573 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437             |
| 5597575 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437             |
| 5597580 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 786,891,919,1040,3274,150437         |
| 5597583 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437             |
| 5598056 |   86708 | Church Street, Sutton on Hull, Kingston upon Hull | 786,891,919,1040,3274,150437         |
+---------+---------+---------------------------------------------------+--------------------------------------+
10 rows in set (0.01 sec)

+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| total         | 13     |
tomatolog commented 6 years ago

seems related to data and might be due tail hits as I made simple test

source src1
{
    type            = mysql

    sql_host        = localhost
    sql_user        = test
    sql_pass        =
    sql_db          = test
    sql_port        = 3306  # optional, default is 3306

    sql_query_pre = SET NAMES utf8
    sql_query       = SELECT 1 id, 11 idd, 'Church Street, Sutton on Hull, Kingston upon Hull' title, 'dudes' pex UNION SELECT 2 id, 12 idd, 'Church on the Street, Sutton on Hull, Kingston upon Hull' title, 'dudes' pex
    sql_attr_uint       = idd
    sql_field_string = title
}

index im1
{
    source          = src1
    path            = data/im1
}

and got correct result set

mysql> select * from im1 where match('@title church NOTNEAR/1 street');
+------+------+----------------------------------------------------------+
| id   | idd  | title                                                    |
+------+------+----------------------------------------------------------+
|    2 |   12 | Church on the Street, Sutton on Hull, Kingston upon Hull |
+------+------+----------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from im1 where match('@title church NOTNEAR/3 street');
Empty set (0.01 sec)

Going to create index with 400 - 500 docs and hits and check there. However it helps in case you provide your data or reproducible case. You might upload your index files to our write only FTP

ftp: 88.99.251.227
user: manticorebugs
pass: shithappens
barryhunter commented 6 years ago

Not sure what tail hits are. Thats the only field in the index (it was a test index to test Group by N mva.

That FTP doesnt seem to work?

NcFTP 3.2.4 (Apr 07, 2010) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 88.99.251.227...
-02-09 11:15:16,977 dev proftpd[1845]: error: no valid servers configured
Unexpected response: -02-09 11:15:16,977 dev proftpd[1845]: fatal: error processing configuration file '/etc/proftpd/proftpd.conf'

dont have proftp on my dev server, so that must be coming from target.

The data is all 'public' anyway (from www.geograph.org.uk), so can just share it here: http://data.geograph.org.uk/manticore-test4.tar.gz (index and conf file)

Thanks for contining to look at this

tomatolog commented 6 years ago

I've reproduced issue with data your provided and going to fix it.

Do you have more different queries for tags index with NOTNEAR operator to test more cases?

barryhunter commented 6 years ago

Another contrived example:

sphinxQL>select id,title,user_id from tags where match('beach NOTNEAR/1 huts') and user_id IN (1601,34311) limit 10;
+---------+---------------------------------+---------+
| id      | title                           | user_id |
+---------+---------------------------------+---------+
| 5598322 | Beer Beach and Beach Huts       |   34311 |  -- CORRECT 
| 5598341 | Beer Beach and cliffs           |   34311 |
| 5598389 | Charmouth Beach                 |   34311 |
| 5602561 | Beach Hut Hire, Sandown         |    1601 |  -- NOT Correct
| 5602657 | Pebble Beach Near Horseshoe Bay |    1601 |
| 5602959 | Lion and Cowes Beach            |    1601 |
| 5603007 | Steps to Gurnard Beach          |    1601 |
| 5603011 | Gurnard Beach                   |    1601 |
+---------+---------------------------------+---------+

sphinxQL>select id,title,user_id from tags where match('beach NEAR/1 huts') and user_id IN (1601,34311) limit 10;
+---------+---------------------------+---------+
| id      | title                     | user_id |
+---------+---------------------------+---------+
| 5598322 | Beer Beach and Beach Huts |   34311 |
| 5602561 | Beach Hut Hire, Sandown   |    1601 |
| 5602564 | Beach Huts                |    1601 |
+---------+---------------------------+---------+

Has partly worked, as say 5602564 (Beach Huts) has been excluded from the first query, but 5602561 is still there. The idnex has stem_en to deal with hut/huts (5598322 is correct in the first query, as has standalone beach too)

Also this:

sphinxQL>select id,user_id,title from tags where match('"road junction" (Road NOTNEAR/1 junction)') limit 30;
+---------+---------+------------------------------------------------------+
| id      | user_id | title                                                |
+---------+---------+------------------------------------------------------+
| 5598445 |   25416 | A573 Warrington Road at Cemetery Road junction       |
| 5601651 |    7056 | The Tullymore Road junction on the Ballynahinch Road |
| 5599012 |    9905 | Road junction with a view                            |
| 5599263 |    9905 | Road junction by Two Brewers                         |
| 5601442 |   42541 | Road junction in Maugersbury                         |
| 5602248 |   43729 | Road Junction between Tewitfield and Priest Hutton   |
| 5603251 |    7629 | Mowtie road junction                                 |
| 5605301 |    8800 | Road Junction at Cobbler's Plain                     |
+---------+---------+------------------------------------------------------+
8 rows in set (0.00 sec)

seems to show not working, if functional, that should return only 5598445.

sphinxQL>select id,user_id,title from tags where match('(Road NOTNEAR/1 junction)') and user_id IN (25416,42541) limit 30;
+---------+---------+---------------------------------------------------+
| id      | user_id | title                                             |
+---------+---------+---------------------------------------------------+
| 5598445 |   25416 | A573 Warrington Road at Cemetery Road junction    |  -- CORRECT
| 5598201 |   25416 | Coach Road entering Knowsley Estate from the A580 |
| 5598238 |   25416 | Carr Mill Road, Billinge                          |
| 5598243 |   25416 | Newton Road near Lower Billinge Lane Farm         |
| 5598254 |   25416 | Billinge Road nearing Smock Lane junction         |
| 5598263 |   25416 | Downall Green Road crossing the M6                |
| 5598317 |   25416 | Whitecroft Road/Bleaklow Close junction           |  -- CORRECT
| 5598457 |   25416 | A573 Warrington Road at Texaco filling station    |
| 5601442 |   42541 | Road junction in Maugersbury                      |  --NOT Correct
+---------+---------+---------------------------------------------------+
tomatolog commented 6 years ago

I've just fixed matching issue at 208c6908 at checked additional queries you provided produce correct result sets. However just in case you get another regression please provide index along with queries or simplified example that reproduce case locally. As matching issues depend on docs\hits order for terms in query. That is why it's hard to start investigation without full data that reproduces issue.

FTP also got fixed now.

barryhunter commented 6 years ago

Thanks. Seems to be working. tried various other queries and so far seems to be working :)