Open nschneid opened 2 days ago
I wonder how hard it would be to flag this as an invalid tokenization, because a gap should never be followed by punctuation (except maybe opening punctuation like "(").
I can reproduce this when going into edit mode. When the user clicks edit
, a string representation of the CGELTree (the result of calling the __str__
method) is passed as a URL parameter to populate the edit window's text box. The CGELTree object will have included a postpunct
attribute on the gap terminal, but the __str__
method won't write metadata tags for nodes lacking a text
attribute (incl. GAP
nodes): https://github.com/nert-nlp/cgel/blob/main/cgel.py#L192
The reason the punctuation immediately gets lost is that when the user opens the edit window, we immediately draw the gtree from the contents of the text box. (More precisely, we initialize an ActivedopTree
object from the text window string and write the gtree by calling the gtree()
method on that object).
I like the idea of flagging invalid tokenization -- we could also try to automatically correct this type of tokenization by moving prepunct
/postpunct
attributes of gap terminals to surrounding non-gap terminals. But I think it might be better to raise the issue to the user, ideally before the user enters edit mode (and before the dopparser attempts initial parses).
Yeah, ideally it would block the retokenize button from being pressed, with an error message explaining why the tokenization is invalid. A kind of form field validation.
Other issues may be higher priority though.
e.g.
word _. .
gets translated to CGEL without the final period