write a CG to CoNLL-U converter

maryszmary commented 7 years ago

The previous version was converting CG to conllx, which had bugs and is not supported in the current version anyway. Necessary for #40.

arademaker commented 7 years ago

What is the CG specification that was used ? I am aware of the niceline format , any reference to the CG format ?

ftyers commented 7 years ago

It looks like this:

# text = He boued e tebr Mona er gegin.
# text[eng] = Mona eats her food here in the kitchen.
# labels = press_1986 ch_syntax p_197 to_check
"<He>"
    "he" det pos f sp @det #1->2
"<boued>"
    "boued" n m sg @obj #2->4
"<e>"
    "e" vpart obj @aux #3->4
"<tebr>"
    "debriñ" vblex pri p3 sg @root #4->0
"<Mona>"
    "Mona" np ant f sg @nsubj #5->4
"<er>"
    "e" pr @case #6->8
        "an" det def sp @det #7->8
"<gegin>"
    "kegin" n f sg @obl #8->4
"<.>"
    "." sent @punct #9->4

Not sure if there is a specification somewhere, @TinoDidriksen might know

arademaker commented 7 years ago

Yes, this is one of the formats cited in http://beta.visl.sdu.dk/cg3/single/#streamformats. But I would like to understand if a more formal specification is available.

TinoDidriksen commented 7 years ago

The documentation is as formal as it gets, at the moment. A proper BNF for the stream format is on my ToDo. There's probably also a difference in what the CG world as a whole thinks and how CG-3 interprets things.

But it is a very accepting format. As long as the initial conditions are met, it's just space separated tags. How those tags are interpreted depends on which tool or grammar consumes them.

ftyers commented 7 years ago

Note that our numbering of subreadings (e.g. "er" = "e" + "an") is non-standard at the moment ... we're hoping that support for this will be added into CG at some point in the future, but for now it's not supported in the main VISL pipeline. (Although it is in the various ancilliary scripts we have for conversion between VISL and other formats).

maryszmary commented 7 years ago

@TinoDidriksen , am I right that there're no js library for parsing this format? (I've found only a python one.)

jonorthwash commented 7 years ago

@maryszmary don't forget this python library for parsing CG!

TinoDidriksen commented 7 years ago

I certainly don't know of any JS library for handling the format, but should be easy enough to make. Just remember the word/base/surface-form and lemmas can contain spaces and unescaped characters - anything after that is just space separated.

maryszmary commented 7 years ago

The converter is now implemented. The only thing which bothers me is the format in which I should convert the grammar features. At the moment, e.g., the CG3 token

"<ашқанда>" "аш" v tv ger_past loc @advcl #3->12

is converted to this conllu line:

3 ашқанда аш v _ tv|gerpast|loc 12 advcl _

Is it OK or should I do something about it?

maryszmary commented 7 years ago

Ah, and of course it breaks on ambiguous analyses. What do you think is the best way to handle this (i.e. how to store the data)?

jonorthwash commented 7 years ago

With ambiguous analyses, just choose one analysis. E.g., just go with the first one..

jonorthwash commented 7 years ago

Is it OK or should I do something about it?

I assume you're asking about the tags? If so, there's no need to convert them to anything else—just leave them as they are.

jonorthwash commented 7 years ago

It appears that it doesn't yet support subreadings?

maryszmary commented 7 years ago

I assume you're asking about the tags? If so, there's no need to convert them to anything else—just leave them as they are.

Yes, I mean grammar tags. Ok.

jonorthwash commented 7 years ago

The graph also doesn't seem to update when the CG-format text is updated.

maryszmary commented 7 years ago

It appears that it doesn't yet support subreadings?

What are subreadings?

maryszmary commented 7 years ago

The graph also doesn't seem to update when the CG-format text is updated.

Yes, this is because all the editing functionality is built for conllu. I can only fix this by writing a conllu to CG converter, which I'm going to do tomorrow.

jonorthwash commented 7 years ago

See earlier in this issue, and elsewhere where we've discussed this.

It's like

"<er>"
    "e" pr @case #6->8
        "an" det def sp @det #7->8

in the example provided earlier in this issue.

The interface supports it when it's encoded in CoNLL-U format.

maryszmary commented 7 years ago

subreadings

Ah, yes, it, it seems to be what I meant saying "ambiguous analyses".

maryszmary commented 7 years ago

Wait, how can CoNLL-U support it? I'm probably missing something. Can you give an example?

jonorthwash commented 7 years ago

all the editing functionality is built for conllu.

Can't it have native support for editing in both CoNLL-U and CG3? Or is it that the GUI can only modify conllu format?

jonorthwash commented 7 years ago

There's an example of subtokens/subreadings here: https://github.com/jonorthwash/ud-annotatrix/issues/53#issuecomment-321082857.

It's very different from ambiguous analyses.

maryszmary commented 7 years ago

Can't it have native support for editing in both CoNLL-U and CG3?

Hm, I thought that some (long) time ago we discussed that the data should be stored in one format, and that the best choice is CoNLL-U. (And it seems to me the right solution -- to have one format to store the data and to handle it by GUI, and have converters to all other formats we want to support). So, I believe, there won't be any problems with editing data when I write a conllu to CG converter. All the editing will still be handled with CoNLL-U, but the user will see (and will be able to edit) CG3.

maryszmary commented 7 years ago

There's an example here: #53 (comment).

Ah, I see! I thought these are ambiguous analyses. I'll fix it tomorrow.

jonorthwash commented 7 years ago

All the editing will still be handled with CoNLL-U, but the user will see (and will be able to edit) CG3.

Okay, I trust your judgement here. Hopefully this won't be too difficult.

jonorthwash commented 7 years ago

Ah, I see! I thought these are ambiguous analyses. I'll fix it tomorrow.

The extra indent is what indicates that it's a subreading and not an additional possible reading.

ftyers commented 7 years ago

Terms:

ambiguous

"<German>"
        "German" adj SELECT:78
;       "German" n sg SELECT:78
;       "German" np ant m sg SELECT:78

and:

"<called>"
        "call" vblex pp 
        "call" vblex past

subreadings:

"<wasn't>"
    "be" vbser past p1 sg
        "not" adv
    "be" vbser past p3 sg
        "not" adv

Here are my thoughts:

1) If the data in CG format is ambiguous (more than one possible reading per word) then it shouldn't be convertible to CoNLL-U ... the box/button for Conllu should be greyed out. 2) The backend data storage should be conllu, features not available in other formats (E.g. editting the surface segmentation) will just not be available in those windows. 3) If the features in conllu are feat=val pairs, then only the val should be shown in CG mode maybe with a tooltip with the feat name 4) if the features in conllu are just tags then having them as "n|m|sg" in Conllu and "n m sg" in CG is fine 5) if the CG is fully disambiguated (1 main reading per token) and it doesn't have @ then @x should be added automatically. Also the token indices should be added automatically.

Here is a video how I do editting in CG mode ideally the interface shouldn't be slower to use than this:

https://youtu.be/HveQZh178T4

maryszmary commented 7 years ago

Thank you @ftyers !

maryszmary commented 7 years ago

if the CG is fully disambiguated (1 main reading per token) and it doesn't have @ then @x should be added automatically. Also the token indices should be added automatically

If I get it right, this is probably not about conversion, but about modifying the data by default. Do you want the interface to add @x automatically as soon as CG data are uploaded?

maryszmary commented 7 years ago

Here is a video how I do editting in CG mode ideally the interface shouldn't be slower to use than this: https://youtu.be/HveQZh178T4

This is relevant for #10.

jonorthwash commented 7 years ago

the box/button for Conllu should be greyed out.

Or / in addition, it would be good to have a message pop up saying why it won't work, or the convert button can be red instead of grey, or the format detection message can turn red and say something like "CG3 with subreadings - not displayable/convertible!"

The main thing is that it should be made obvious to the user why it won't convert the CG or display it as a graph.

ftyers commented 7 years ago

If I get it right, this is probably not about conversion, but about modifying the data by default. Do you want the interface to add @x automatically as soon as CG data are uploaded?

only if the data are unambiguous, perhaps instead of automatically with a keyboard shortcut like select all CTRL+A then CTRL+@ ?

@jonorthwash yes, any of those options are good too.

@maryszmary yes also relevant to #10 but I don't think we should implement the keyboard shortcuts for disambiguation right now, not enough time, not high priority.

maryszmary commented 7 years ago

Now the converter supports subreadings:

maryszmary commented 7 years ago

It also reacts to ambiguity:

jonorthwash / ud-annotatrix

write a CG to CoNLL-U converter #52