korpling / ANNIS

ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation.
http://corpus-tools.org/annis/
Apache License 2.0
68 stars 25 forks source link

Frequencies fails on disjoint AQL #394

Closed amir-zeldes closed 9 years ago

amir-zeldes commented 9 years ago

When a disjunction is used in AQL, the frequencies interface instructs the user to name nodes with the same name to group results correctly. This is fine, but following the instructions and executing the frequency query returns nothing. Example in any SCRIPTORIUM corpus (e.g. besa.letters):

n1#pos | n1#norm

The interface does not complain, but outputs nothing.

thomaskrause commented 9 years ago

This seems to be a problem with the automatic frequency definition generation.

If you manually remove one of the lines (either "pos" or "norm") you will get a result for the nodes carrying the selected annotations. Since no node has both the "norm" and "pos" annotation you don't get any results.

Unfortunately it's impossible to know in beforehand if a node can have both annotations or not. The same applies if people manually define annotations as output that don't exist. I'm not sure how to solve this for the automatic mode. We might issue a warning if we detect that the same node is referenced with two different annotations and select only the first one automatically.

amir-zeldes commented 9 years ago

I thinks for annotations that don’t exist, not outputting anything is fine as a first approximation (though Shuo is working on validation for annotations existing for the normal query mode, and maybe someday we’ll want that for frequencies too).

For the disjunction case, I think the solution should be generic like this:

For any disjunction, two (or more) frequency queries are created, each one oblivious of the others. Then they are UNION ALLed, so that #1 or #my_anno occupies the same column. If a node has both annotations, then two results will be fetched for that node (as expected) and it will be duplicated in the matrix (IDs can be used in retrospect to notice this). If some (or all) nodes don’t have both annotations, we will just get one record for the one that does exist. I think this is the expected behavior for disjunction and I think it’s robust for both cases where multiple annotations match, or just one.

Alternatively, for exporters that automatically give all annos for each match node (Weka), we could run a distinct or (just a non-ALL type UNION) and give output for any node that matches, giving all of its annotations with column names as specified by the Weka header.

From: Thomas Krause [mailto:notifications@github.com] Sent: Tuesday, March 17, 2015 07:38 To: korpling/ANNIS Cc: Amir Zeldes Subject: Re: [ANNIS] Frequencies fails on disjoint AQL (#394)

This seems to be a problem with the automatic frequency definition generation.

If you manually remove one of the lines (either "pos" or "norm") you will get a result for the nodes carrying the selected annotations. Since no node has both the "norm" and "pos" annotation you don't get any results.

Unfortunately it's impossible to know in before if a node can have both annotations or not. The same applies if people manually define annotations as output that don't exist. I'm not sure how to solve this for the automatic mode. We might issue a warning if we detect that the same node is referenced with two different annotations and select only the first one automatically.

— Reply to this email directly or view it on GitHub https://github.com/korpling/ANNIS/issues/394#issuecomment-82296468 . https://github.com/notifications/beacon/ACFlWxiRMYmXrVEMccBXJk-UFezCbckfks5n2Al9gaJpZM4DvcbZ.gif