nschneid / activedop

A treebank annotation tool based on a statistical parser that is re-trained during annotation
GNU General Public License v2.0
3 stars 1 forks source link

in input sentence, punctuation immediately after gap gets lost #83

Open nschneid opened 2 days ago

nschneid commented 2 days ago

e.g. word _. . gets translated to CGEL without the final period

nschneid commented 2 days ago

I wonder how hard it would be to flag this as an invalid tokenization, because a gap should never be followed by punctuation (except maybe opening punctuation like "(").

bwaldon commented 2 days ago

I can reproduce this when going into edit mode. When the user clicks edit, a string representation of the CGELTree (the result of calling the __str__ method) is passed as a URL parameter to populate the edit window's text box. The CGELTree object will have included a postpunct attribute on the gap terminal, but the __str__ method won't write metadata tags for nodes lacking a text attribute (incl. GAP nodes): https://github.com/nert-nlp/cgel/blob/main/cgel.py#L192

The reason the punctuation immediately gets lost is that when the user opens the edit window, we immediately draw the gtree from the contents of the text box. (More precisely, we initialize an ActivedopTree object from the text window string and write the gtree by calling the gtree() method on that object).

I like the idea of flagging invalid tokenization -- we could also try to automatically correct this type of tokenization by moving prepunct/postpunct attributes of gap terminals to surrounding non-gap terminals. But I think it might be better to raise the issue to the user, ideally before the user enters edit mode (and before the dopparser attempts initial parses).

nschneid commented 2 days ago

Yeah, ideally it would block the retokenize button from being pressed, with an error message explaining why the tokenization is invalid. A kind of form field validation.

Other issues may be higher priority though.