Closed maryszmary closed 7 years ago
What is the CG specification that was used ? I am aware of the niceline format , any reference to the CG format ?
It looks like this:
# text = He boued e tebr Mona er gegin.
# text[eng] = Mona eats her food here in the kitchen.
# labels = press_1986 ch_syntax p_197 to_check
"<He>"
"he" det pos f sp @det #1->2
"<boued>"
"boued" n m sg @obj #2->4
"<e>"
"e" vpart obj @aux #3->4
"<tebr>"
"debriñ" vblex pri p3 sg @root #4->0
"<Mona>"
"Mona" np ant f sg @nsubj #5->4
"<er>"
"e" pr @case #6->8
"an" det def sp @det #7->8
"<gegin>"
"kegin" n f sg @obl #8->4
"<.>"
"." sent @punct #9->4
Not sure if there is a specification somewhere, @TinoDidriksen might know
Yes, this is one of the formats cited in http://beta.visl.sdu.dk/cg3/single/#streamformats. But I would like to understand if a more formal specification is available.
The documentation is as formal as it gets, at the moment. A proper BNF for the stream format is on my ToDo. There's probably also a difference in what the CG world as a whole thinks and how CG-3 interprets things.
But it is a very accepting format. As long as the initial conditions are met, it's just space separated tags. How those tags are interpreted depends on which tool or grammar consumes them.
Note that our numbering of subreadings (e.g. "er" = "e" + "an") is non-standard at the moment ... we're hoping that support for this will be added into CG at some point in the future, but for now it's not supported in the main VISL pipeline. (Although it is in the various ancilliary scripts we have for conversion between VISL and other formats).
@TinoDidriksen , am I right that there're no js library for parsing this format? (I've found only a python one.)
@maryszmary don't forget this python library for parsing CG!
I certainly don't know of any JS library for handling the format, but should be easy enough to make. Just remember the word/base/surface-form and lemmas can contain spaces and unescaped characters - anything after that is just space separated.
The converter is now implemented. The only thing which bothers me is the format in which I should convert the grammar features. At the moment, e.g., the CG3 token
"<ашқанда>" "аш" v tv ger_past loc @advcl #3->12
is converted to this conllu line:
3 ашқанда аш v _ tv|gerpast|loc 12 advcl _
Is it OK or should I do something about it?
Ah, and of course it breaks on ambiguous analyses. What do you think is the best way to handle this (i.e. how to store the data)?
With ambiguous analyses, just choose one analysis. E.g., just go with the first one..
Is it OK or should I do something about it?
I assume you're asking about the tags? If so, there's no need to convert them to anything else—just leave them as they are.
It appears that it doesn't yet support subreadings?
I assume you're asking about the tags? If so, there's no need to convert them to anything else—just leave them as they are.
Yes, I mean grammar tags. Ok.
The graph also doesn't seem to update when the CG-format text is updated.
It appears that it doesn't yet support subreadings?
What are subreadings?
The graph also doesn't seem to update when the CG-format text is updated.
Yes, this is because all the editing functionality is built for conllu. I can only fix this by writing a conllu to CG converter, which I'm going to do tomorrow.
See earlier in this issue, and elsewhere where we've discussed this.
It's like
"<er>"
"e" pr @case #6->8
"an" det def sp @det #7->8
in the example provided earlier in this issue.
The interface supports it when it's encoded in CoNLL-U format.
subreadings
Ah, yes, it, it seems to be what I meant saying "ambiguous analyses".
Wait, how can CoNLL-U support it? I'm probably missing something. Can you give an example?
all the editing functionality is built for conllu.
Can't it have native support for editing in both CoNLL-U and CG3? Or is it that the GUI can only modify conllu format?
There's an example of subtokens/subreadings here: https://github.com/jonorthwash/ud-annotatrix/issues/53#issuecomment-321082857.
It's very different from ambiguous analyses.
Can't it have native support for editing in both CoNLL-U and CG3?
Hm, I thought that some (long) time ago we discussed that the data should be stored in one format, and that the best choice is CoNLL-U. (And it seems to me the right solution -- to have one format to store the data and to handle it by GUI, and have converters to all other formats we want to support). So, I believe, there won't be any problems with editing data when I write a conllu to CG converter. All the editing will still be handled with CoNLL-U, but the user will see (and will be able to edit) CG3.
There's an example here: #53 (comment).
Ah, I see! I thought these are ambiguous analyses. I'll fix it tomorrow.
All the editing will still be handled with CoNLL-U, but the user will see (and will be able to edit) CG3.
Okay, I trust your judgement here. Hopefully this won't be too difficult.
Ah, I see! I thought these are ambiguous analyses. I'll fix it tomorrow.
The extra indent is what indicates that it's a subreading and not an additional possible reading.
Terms:
ambiguous
"<German>"
"German" adj SELECT:78
; "German" n sg SELECT:78
; "German" np ant m sg SELECT:78
and:
"<called>"
"call" vblex pp
"call" vblex past
subreadings:
"<wasn't>"
"be" vbser past p1 sg
"not" adv
"be" vbser past p3 sg
"not" adv
Here are my thoughts:
1) If the data in CG format is ambiguous (more than one possible reading per word) then it shouldn't be convertible to CoNLL-U ... the box/button for Conllu should be greyed out.
2) The backend data storage should be conllu, features not available in other formats (E.g. editting the surface segmentation) will just not be available in those windows.
3) If the features in conllu are feat=val pairs, then only the val should be shown in CG mode maybe with a tooltip with the feat name
4) if the features in conllu are just tags then having them as "n|m|sg" in Conllu and "n m sg" in CG is fine
5) if the CG is fully disambiguated (1 main reading per token) and it doesn't have @
then @x
should be added automatically. Also the token indices should be added automatically.
Here is a video how I do editting in CG mode ideally the interface shouldn't be slower to use than this:
Thank you @ftyers !
if the CG is fully disambiguated (1 main reading per token) and it doesn't have
@
then@x
should be added automatically. Also the token indices should be added automatically
If I get it right, this is probably not about conversion, but about modifying the data by default. Do you want the interface to add @x
automatically as soon as CG data are uploaded?
Here is a video how I do editting in CG mode ideally the interface shouldn't be slower to use than this: https://youtu.be/HveQZh178T4
This is relevant for #10.
the box/button for Conllu should be greyed out.
Or / in addition, it would be good to have a message pop up saying why it won't work, or the convert button can be red instead of grey, or the format detection message can turn red and say something like "CG3 with subreadings - not displayable/convertible!"
The main thing is that it should be made obvious to the user why it won't convert the CG or display it as a graph.
If I get it right, this is probably not about conversion, but about modifying the data by default. Do you want the interface to add @x automatically as soon as CG data are uploaded?
only if the data are unambiguous, perhaps instead of automatically with a keyboard shortcut like select all CTRL+A then CTRL+@ ?
@jonorthwash yes, any of those options are good too.
@maryszmary yes also relevant to #10 but I don't think we should implement the keyboard shortcuts for disambiguation right now, not enough time, not high priority.
Now the converter supports subreadings:
It also reacts to ambiguity:
The previous version was converting CG to conllx, which had bugs and is not supported in the current version anyway. Necessary for #40.