Closed barryhunter closed 6 years ago
Thank you for issue, I've put it to our backlog.
to make things clear - this will be only 2 keywords operator?
I mean in case of phrase operator modification 3 and more keywords phrase with negative assertion "Church -Street saint"
set uncertain for keyword positions.
NOT NEAR
seems more solid here, however I'd prefer 1 token operator - maybe NEAR/NOT
?
or you need to set distance too in case NEAR
will be base of this implementation?
There is also option to fix NEAR
to work with following NOT
term.
Well, "Church -Street saint" doesnt seem particularly useful, but it would be nice if worked. But ulitimate guess its kinda abigiuous. Would it be match Church saint
or only Church * saint
, like Church box saint
(although its clear it shouldnt match Church Street saint
:)
But could want multiple word phrases like "-under road bridge" (road bridges, but not under such bridges ;)
NEAR/NOT would be fine as such, franky not expecting users to type these queries
To clarify the usecase, want to create a search function, that shows 'common' phrases around the term, and then allow the user to exclude them. eg a plain keyword search for 'Church' includes photos of actual church (the building), but also contains photos mentioning 'Church Street' (not churches) or even things like 'near St Marys Church"
But also a consideration, how can 'chain' multiple assertions, like
church NOT/NEAR street NOT/NEAR inside
ie dont want to to match "church streeet", nor "inside church" but other church
matches are fine. Suppose
church NOT/NEAR (street | inside | road)
gives flexiblity...
"road bridge" NOT/NEAR under
possibly? NOT/NEAR is also a slight issue, that it not 'strict order', so "street church" also excluded, but overall not too concerned.
I've look at NEAR with following negative term deprecated for now and see the
Church NEAR/1 -Street
or Church NEAR/3 -Street
might work well in case it got implemented
it also could work for complex things like
"church * saint" NEAR/1 -street
- ie
church saint
phrase with atom delta=2 and NOT street with atom delta=1 from atom position of NEAR left node. As NEAR
allows complex nodes not just simple terms.
However still not clear do you need to build complex phrase and exclude single term at particular position at that phrase?
Church NEAR/1 -Street or Church NEAR/3 -Street might work well in case it got implemented
thats possibly not the most intuitive, but if that was implemented think would be happy.
However still not clear do you need to build complex phrase and exclude single term at particular position at that phrase?
Not really. See it most useful to exclude words before/after a given phrase, rather than in the middle of a phrase.
Although
church NEAR -"near st *"
could be useful as in Church but not "near st marys church" nor "near st georges church"
but maybe
church NEAR/3 -near
woudl be enough
I've just pushed this feature to master branch. That is new full-text operator NOTNEAR
works as
Church NOTNEAR/2 street
Left argument of operator is MUST
term, operator or group of operators that should be exists at document to match.
Right argument of operator is NOT term, operator or group of operators. There is no need to add minus to right argument.
You have to set positive non zero distance for operator like NOTNEAR/dist
.
ie document a b c d e g f i k
queries 'a NOTNEAR/1 b' - not mach, as positions a+1==b 'a NOTNEAR/1 c' - mach, as positions a+1<c 'a NOTNEAR/2 c' - not mach, as positions a+2==c
complex queries '"a b" NOTNEAR/1 c' - not mach, as positions b+1==c '"a b c" NOTNEAR/3 f' - mach, as positions c+3<f
I've added description of this operator to Extended query syntax
topic of documentation
You also might find examples at 349 test at source code.
Only just got round to testing this. Not sure if working
sphinxQL>select * from tags where match('@title church NOTNEAR/1 street') and user_id = 86708 limit 10; show meta; show plan;
+---------+---------+---------------------------------------------------+--------------------------+
| id | user_id | title | tags |
+---------+---------+---------------------------------------------------+--------------------------+
| 5599363 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
+---------+---------+---------------------------------------------------+--------------------------+
1 row in set (0.00 sec)
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| total | 1 |
| total_found | 1 |
| time | 0.001 |
| keyword[0] | church |
| docs[0] | 407 |
| hits[0] | 411 |
| keyword[1] | street |
| docs[1] | 549 |
| hits[1] | 555 |
+---------------+--------+
9 rows in set (0.00 sec)
+------------------+-------------------------------------------------------------------------------------------------------------------+
| Variable | Value |
+------------------+-------------------------------------------------------------------------------------------------------------------+
| transformed_tree | NOTNEAR(
AND(fields=(title), KEYWORD(church, querypos=1)),
AND(fields=(title), KEYWORD(street, querypos=2))) |
+------------------+-------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Server version: 2.6.2 5fcf016@180207 dev
Its doing something, because
sphinxQL>select * from tags where match('@title church NEAR/1 street') and user_id = 86708 limit 10; show meta; show plan;
+---------+---------+---------------------------------------------------+--------------------------------------+
| id | user_id | title | tags |
+---------+---------+---------------------------------------------------+--------------------------------------+
| 5595804 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 868,891,1025,1279,3274,150437,180819 |
| 5597549 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
| 5597554 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,150437 |
| 5597564 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
| 5597568 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
| 5597573 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
| 5597575 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
| 5597580 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 786,891,919,1040,3274,150437 |
| 5597583 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 891,919,1040,3274,150437 |
| 5598056 | 86708 | Church Street, Sutton on Hull, Kingston upon Hull | 786,891,919,1040,3274,150437 |
+---------+---------+---------------------------------------------------+--------------------------------------+
10 rows in set (0.01 sec)
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| total | 13 |
seems related to data and might be due tail hits as I made simple test
source src1
{
type = mysql
sql_host = localhost
sql_user = test
sql_pass =
sql_db = test
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET NAMES utf8
sql_query = SELECT 1 id, 11 idd, 'Church Street, Sutton on Hull, Kingston upon Hull' title, 'dudes' pex UNION SELECT 2 id, 12 idd, 'Church on the Street, Sutton on Hull, Kingston upon Hull' title, 'dudes' pex
sql_attr_uint = idd
sql_field_string = title
}
index im1
{
source = src1
path = data/im1
}
and got correct result set
mysql> select * from im1 where match('@title church NOTNEAR/1 street');
+------+------+----------------------------------------------------------+
| id | idd | title |
+------+------+----------------------------------------------------------+
| 2 | 12 | Church on the Street, Sutton on Hull, Kingston upon Hull |
+------+------+----------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select * from im1 where match('@title church NOTNEAR/3 street');
Empty set (0.01 sec)
Going to create index with 400 - 500 docs and hits and check there. However it helps in case you provide your data or reproducible case. You might upload your index files to our write only FTP
ftp: 88.99.251.227
user: manticorebugs
pass: shithappens
Not sure what tail hits are. Thats the only field in the index (it was a test index to test Group by N mva.
That FTP doesnt seem to work?
NcFTP 3.2.4 (Apr 07, 2010) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 88.99.251.227...
-02-09 11:15:16,977 dev proftpd[1845]: error: no valid servers configured
Unexpected response: -02-09 11:15:16,977 dev proftpd[1845]: fatal: error processing configuration file '/etc/proftpd/proftpd.conf'
dont have proftp on my dev server, so that must be coming from target.
The data is all 'public' anyway (from www.geograph.org.uk), so can just share it here: http://data.geograph.org.uk/manticore-test4.tar.gz (index and conf file)
Thanks for contining to look at this
I've reproduced issue with data your provided and going to fix it.
Do you have more different queries for tags
index with NOTNEAR
operator to test more cases?
Another contrived example:
sphinxQL>select id,title,user_id from tags where match('beach NOTNEAR/1 huts') and user_id IN (1601,34311) limit 10;
+---------+---------------------------------+---------+
| id | title | user_id |
+---------+---------------------------------+---------+
| 5598322 | Beer Beach and Beach Huts | 34311 | -- CORRECT
| 5598341 | Beer Beach and cliffs | 34311 |
| 5598389 | Charmouth Beach | 34311 |
| 5602561 | Beach Hut Hire, Sandown | 1601 | -- NOT Correct
| 5602657 | Pebble Beach Near Horseshoe Bay | 1601 |
| 5602959 | Lion and Cowes Beach | 1601 |
| 5603007 | Steps to Gurnard Beach | 1601 |
| 5603011 | Gurnard Beach | 1601 |
+---------+---------------------------------+---------+
sphinxQL>select id,title,user_id from tags where match('beach NEAR/1 huts') and user_id IN (1601,34311) limit 10;
+---------+---------------------------+---------+
| id | title | user_id |
+---------+---------------------------+---------+
| 5598322 | Beer Beach and Beach Huts | 34311 |
| 5602561 | Beach Hut Hire, Sandown | 1601 |
| 5602564 | Beach Huts | 1601 |
+---------+---------------------------+---------+
Has partly worked, as say 5602564 (Beach Huts) has been excluded from the first query, but 5602561 is still there. The idnex has stem_en to deal with hut/huts (5598322 is correct in the first query, as has standalone beach too)
Also this:
sphinxQL>select id,user_id,title from tags where match('"road junction" (Road NOTNEAR/1 junction)') limit 30;
+---------+---------+------------------------------------------------------+
| id | user_id | title |
+---------+---------+------------------------------------------------------+
| 5598445 | 25416 | A573 Warrington Road at Cemetery Road junction |
| 5601651 | 7056 | The Tullymore Road junction on the Ballynahinch Road |
| 5599012 | 9905 | Road junction with a view |
| 5599263 | 9905 | Road junction by Two Brewers |
| 5601442 | 42541 | Road junction in Maugersbury |
| 5602248 | 43729 | Road Junction between Tewitfield and Priest Hutton |
| 5603251 | 7629 | Mowtie road junction |
| 5605301 | 8800 | Road Junction at Cobbler's Plain |
+---------+---------+------------------------------------------------------+
8 rows in set (0.00 sec)
seems to show not working, if functional, that should return only 5598445.
sphinxQL>select id,user_id,title from tags where match('(Road NOTNEAR/1 junction)') and user_id IN (25416,42541) limit 30;
+---------+---------+---------------------------------------------------+
| id | user_id | title |
+---------+---------+---------------------------------------------------+
| 5598445 | 25416 | A573 Warrington Road at Cemetery Road junction | -- CORRECT
| 5598201 | 25416 | Coach Road entering Knowsley Estate from the A580 |
| 5598238 | 25416 | Carr Mill Road, Billinge |
| 5598243 | 25416 | Newton Road near Lower Billinge Lane Farm |
| 5598254 | 25416 | Billinge Road nearing Smock Lane junction |
| 5598263 | 25416 | Downall Green Road crossing the M6 |
| 5598317 | 25416 | Whitecroft Road/Bleaklow Close junction | -- CORRECT
| 5598457 | 25416 | A573 Warrington Road at Texaco filling station |
| 5601442 | 42541 | Road junction in Maugersbury | --NOT Correct
+---------+---------+---------------------------------------------------+
I've just fixed matching issue at 208c6908 at checked additional queries you provided produce correct result sets. However just in case you get another regression please provide index along with queries or simplified example that reproduce case locally. As matching issues depend on docs\hits order for terms in query. That is why it's hard to start investigation without full data that reproduces issue.
FTP also got fixed now.
Thanks. Seems to be working. tried various other queries and so far seems to be working :)
Match a keyword, but NOT when preceded/followed by some word. Not sure exact syntax, perhaps
"Church -Street"
(in the quotes). Would match all instances of Church, but NOT one in "Church Street".Query:
"Church -Street"
Doc1:St. Mary's Church
- Match (has Church) Doc2:Cars on Church Street
- NOT Match (has "Church Street") Doc3:St. Mary's Church at end of Church Street
- Match (has Church on its own)The last document exemplifies why
Church -"Church Street"
almost but not quite works. The negative term would exclude the result despite the other Church match.Alternatively it could be a 'Not Near' operator somehow. As in
Church NOT NEAR/2 Street