korpling / annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.
Apache License 2.0
1 stars 0 forks source link

CoNLL-U metadata validation/cleansing #251

Open chiarcos opened 1 week ago

chiarcos commented 1 week ago
MartinKl commented 1 week ago

Thank you for the submission. Your request addresses several issues.

First, the dependency visualizer does not work, because graphml export sets the node key wrong. This is being fixed by #252

About supporting the data you provided and/or other versions of CoNLL: We suggest to stick to the notation using = as key-value delimiter for sentence annotations, since this seems easy to replace. We will nevertheless extend the conll module to import annotations that do not start with key = as bare values that will be added as a sentence annotation conll::comment holding said value. See #257 for more details. In case of your data this would lead to annotations conll::comment="text: ..." for each sentence.

Are there any other features of CoNLL-X that you consider necessary?

chiarcos commented 1 week ago

Thank you, #257 is the best way to deal with that IMHO.

As for other features of CoNLL-X, the last two columns have different functions (cf. https://aclanthology.org/W06-2920.pdf). I guess it's not worth supporting that because they were not widely used, in the first place and this pertains to legacy data, only, which does not seem to be publicly available anymore (at least not from https://ilk.uvt.nl/conll/post_task_data.html). It is still used by some older parsers, though, and sometimes required as input for downstream tasks. So, while I would not advise to go for full CoNLL-X support, I would suggest to be robust against CoNLL-X input, i.e., check whether CoNLL-X data with PHEAD (9th column) set to an integer would break the CoNLL-U conversion, because CoNLL-U expects pairs of IDs and dependency labels, there, and only these.

You can synthesize such data from CoNLL-U data by just copying the values from the HEAD column into the 9th column, and the values from the DEP column into the 10th column.