token → subtokens and merge subtokens

jonorthwash / ud-annotatrix

GNU General Public License v3.0

63 stars 49 forks source link

token → subtokens and merge subtokens #53

Open jonorthwash opened 7 years ago

jonorthwash commented 7 years ago

Add functionality to the existing tokenisation routines (#2) so that tokens can be split into subtokens and adjacent subtokens can be merged.

maryszmary commented 7 years ago

Hm, not sure I understand. Can you give an example?

jonorthwash commented 7 years ago

I guess one example is "wanna". After splitting this into two tokens, there should be a way to make them subtokens of a single token.

Another example is here:

1   Бұлардың    бұл _   prn dem|pl|gen  6   nmod:poss   _   _
2   бір бір _   num _   3   nummod  _   _
3   ауыз    ауыз    _   n   nom 6   acl:relcl   _   _
4   бола    бол _   v   iv|prc_impf 3   cop _   _
5   алмаған ал  _   vaux    neg|gpr_past    3   aux _   _
6   себебі  себеп   _   n   px3sp|nom   7   nsubj   _   _
7-8 не  _   _   _   _   _   _   _   _
7   не  не  _   prn itg|nom 0   root    _   _
8   _   е   _   cop aor|p3|sg   7   cop _   _
9   ?   ?   _   sent    _   7   punct   _   _

Let's say you want to make 7 and 8 separate tokens instead of subtokens. There should be some way to do that in the interface.

maryszmary commented 7 years ago

Ah, I see, this is closely related to #8 and #36.

jonorthwash commented 7 years ago

Yes, those are both prerequisites to working on this, probably. Those are the display- and format-level implementations; this is the editing-interface-level implementation, I guess.

maryszmary commented 7 years ago

If I got it right, now it works: left click on a token, then press s, then select with an arrow, which neighbor you want to merge with it: