Open amir-zeldes opened 9 years ago
Sound like a good approach, I can imagine, that this will reduce converting, importing and query time. One question would be is it a problem for ANNIS to only have non-typed or in the case of rst to only have typed edges? @thomaskrause : Would that fit into ANNIS?
In theory this should work well with the query SQL generation. I just need to check if there are any other places (like the corpus explorer/browser) if there is the implicit assumption that there is always at least one named component.
For many corpora there are not multiple types of edges (or at least not for a certain annotation layer). For example, the Penn Treebank, WSJ, Switchboard, and also OntoNotes derived from the same underlying corpus, there are no secedges (in fact, anything coming from PTB brackets).
The relANNIS export by default generates both a NULL component and a typed component (usually 'edge') so that these searches work:
cat="NP" > cat="PP" cat="NP" >edge cat="PP"
In a PTB corpus, no one will ever use the second query. For this reason, it would be useful to have a special parameter that enforces either only the NULL component being generated on export or only the named one, for dominance edges in a certain SLayer (in this case, all dominance edges in the layer "ptb" should only be exported as NULL). In a corpus with PTB trees and RST trees, like GUM, it would be useful to have the syntax trees have only NULL components, and the RST have only typed "rst" components.
Removing the second duplicate of the components would lead to a massive reduction in corpus size, and an increase in performance for PTB style corpora, which are many.