ufal / lindat-kontext

An alternative web front-end for the Manatee corpus search engine
GNU General Public License v2.0
5 stars 1 forks source link

meet operator behaving strangely #164

Closed Ansa211 closed 6 years ago

Ansa211 commented 6 years ago

This issue is very confusing for me, any explanation would be welcome.

(meet 1:[mwe_id="(.*;.*)"] 2:[] -5 -1) & 1.mwe_id=2.mwe_id within <s/> has 3 results, and as expected, all three are to the left of the main word because meet has parameters -5 -1

(meet 1:[mwe_id="(.*;.*)"] 2:[] 1 5) & 1.mwe_id=2.mwe_id within <s/> has 4 results and as expected, all 4 are to the right of the KWIC word because meet has parameters 1 5

Question 1: the only condition on nodes 1 and 2 is one of equality; in other words, the second query should match the same sentences as the first, but the two words should swap roles (in the second query, the left one of them should be KWIC and the right one should be in context). Why is it not so?

(meet 1:[mwe_id="(.*;.*)"] 2:[] -5 5) & 1.mwe_id=2.mwe_id within <s/> should match both to the left and to the right of the KWIC word because of the parameters -5 5; however, it gives the same result as setting the parameters to -5 -1, why?

Ansa211 commented 6 years ago

Same issue, simplified queries, not so easy to overview output:

(meet 1:[mwe_id!="_"] 2:[mwe_id!="_"] 1 5) & 1.mwe_id=2.mwe_id & 1.word!=2.word within <s/> ---> apply negative filter (meet 1:[mwe_id!="_"] 2:[mwe_id!="_"] -5 -1) & 1.mwe_id=2.mwe_id & 1.word!=2.word within <s/>

Again, I expected that the second query matches exactly the same nodes as the first one but with roles swapped, so after the application of the negative filter, there should be nothing left; so why are there 22 lines in the output?

Ansa211 commented 6 years ago

I tested this on an instance of NoSketchEngine; the results are the same, so the problem must be in Manatee. I will report it to SketchEngine people.

Ansa211 commented 6 years ago

Answer from SketchEngine:

Dear Anna,

unfortunately I don't have any good news for you -- after much deliberation, the gurus told me that when label positions are ambiguous, the result is unspecified. Currently, only one of the possibilities is propagated through the evaluation tree. Only the position of the KWIC is what differentiates between different result rows.

Therefore, queries like this are not well-formed and should be avoided. The query can possibly be formulated in a different way or perhaps emulated using the filtering functionality on concordances.

Best Regards, Ondrej Herman

Sketch Engine Team


Previous communication

URL: https://the.sketchengine.co.uk/corpus/first?corpname=preloaded%2Fsusanne&reload=&iquery=&queryselector=cqlrow&lemma=&lpos=&phrase=&word=&wpos=&char=&cql=%28meet+1%3A%22his%22+2%3A%5B%5D+1+5%29+%261.word%3D2.word&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all

I do not understand why this query has empty output, while http://ske.li/e6x has 18 results. My expectation was that this query matches exactly the same sentences, but with the first of the two words being the KWIC (instead of the second which is the KWIC in http://ske.li/e6).

I have described another example of a similar problem with the meet operator and global conditions at https://github.com/ufal/lindat-kontext/issues/164 . The same queries as mentioned there were tested in a NoSke instance on http://corpora.phil.hhu.de/bonito/parseme.cgi/first?corpname=parseme_de_a&reload=1&iquery=&queryselector=cqlrow&lemma=&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all, so I believe the unexpected behaviour is due to Manatee and not due to the front-end.

Ansa211 commented 6 years ago

At least the following two queries, in which the conditions on node 2 have been more fully specified, have the same number of results (30): (meet 1:[mwe_id="(.*;.*)"] 2:[mwe_id="(.*;.*)"] -5 -1) & 1.mwe_id=2.mwe_id within <s/> (meet 1:[mwe_id="(.*;.*)"] 2:[mwe_id="(.*;.*)"] 1 5) & 1.mwe_id=2.mwe_id within <s/>

But this version has 122 - and in some of them, only one node is highlighted: (meet 1:[mwe_id="(.*;.*)"] 2:[mwe_id="(.*;.*)"] -5 5) & 1.mwe_id=2.mwe_id within <s/>

Ansa211 commented 6 years ago

Further correspondence with Ondrej Herman has clarified the issue even further.

From my message:

Could you please be more specific about what you mean by ambiguous label positions? Is this the case that any query of the form (meet 1:[conditions1] 2:[conditions2] -num1 num2) & 1.attribute1 = 2.attribute2 is malformed? Or even any (meet 1:[conditions1] 2:[conditions2] -num1 num2) ? (From a tiny bit of experimentation, I suspect the latter.) Also, does the same issue concern any other query types that you can think of?

I tried to emulate such queries (with the condition on some parameters being equal between the two words, so that I really need the labels) through the use of filters, but I found no way how to do it - the labels of the positions (such as 1 and 2) are not remembered from the original query to the application of the filter.

Of course, one could go back to (1:[conditions1] []{0,num2} 2:[conditions2] | 2:[conditions2] []{0,num1} 1:[conditions1]) & 1.attribute1 = 2.attribute2 which should work (is that correct?), but that means loosing the functionality of meet (the fact that only the two relevant words are highlighted and only one of them is the KWIC).

I would be grateful if you have any further ideas for reformulating/emulating this type of query. But more importantly, I would like to understand better which queries I should avoid.


From Ondrej Herman's reply:

Operace (meet A B x y) se snaží vyhledat všechna A, která mají v okně daném parametry x a y nějaký výskyt B. Globální podmínka pak filtruje řádky tohoto výsledku, které neodpovídají žádané podmínce. To znamená, že ve výsledku nikdy nebude víc výskytů A na stejné pozici. Meet obecně ani není komutativní.

Váš první příklad může dávat platné výsledky, ale pouze pokud v korpusu ke každému A existuje právě jedno B. Druhý dotaz je v pořádku, ale výsledek jsou opět všechna A a label pro B je pro jednotlivé výskyty spíše informativního charakteru.

Dotaz (1:[conditions1] []{0,num2} 2:[conditions2] | 2:[conditions2] []{0,num1} 1:[conditions1]) & 1.attribute1 = 2.attribute2 má obdobný problém. Částečné výsledky levé a pravé části kolem svislítka mohou být identické a lišit se jen v labelech. K vyhodnocení globalní podmínky se pak přes operátor svislítka dostane jen jeden z nich.

Dotaz s "meet" má ještě jeden rozdíl oproti tomuto dotazu -- A a B s meet se mohou nacházet na stejné pozici.

Obávám se, že tato omezení v CQL nedokážeme moc dobře, blížíte se k limitům jazyka. Ani jiné řešení, které by šlo naklikat, mě nenapadá, ale zkusím se ještě poptat.

Osobně bych postupoval tak, že bych upravil skript corpquery distribuovaný s Manatee -- krmil bych jej Vaším dotazem bez globálních podmínek, které bych vyhodnocoval mimo CQL