jonorthwash / ud-annotatrix

GNU General Public License v3.0
63 stars 49 forks source link

token → subtokens and merge subtokens #53

Open jonorthwash opened 7 years ago

jonorthwash commented 7 years ago

Add functionality to the existing tokenisation routines (#2) so that tokens can be split into subtokens and adjacent subtokens can be merged.

maryszmary commented 7 years ago

Hm, not sure I understand. Can you give an example?

jonorthwash commented 7 years ago

I guess one example is "wanna". After splitting this into two tokens, there should be a way to make them subtokens of a single token.

Another example is here:

1   Бұлардың    бұл _   prn dem|pl|gen  6   nmod:poss   _   _
2   бір бір _   num _   3   nummod  _   _
3   ауыз    ауыз    _   n   nom 6   acl:relcl   _   _
4   бола    бол _   v   iv|prc_impf 3   cop _   _
5   алмаған ал  _   vaux    neg|gpr_past    3   aux _   _
6   себебі  себеп   _   n   px3sp|nom   7   nsubj   _   _
7-8 не  _   _   _   _   _   _   _   _
7   не  не  _   prn itg|nom 0   root    _   _
8   _   е   _   cop aor|p3|sg   7   cop _   _
9   ?   ?   _   sent    _   7   punct   _   _

Let's say you want to make 7 and 8 separate tokens instead of subtokens. There should be some way to do that in the interface.

maryszmary commented 7 years ago

Ah, I see, this is closely related to #8 and #36.

jonorthwash commented 7 years ago

Yes, those are both prerequisites to working on this, probably. Those are the display- and format-level implementations; this is the editing-interface-level implementation, I guess.

maryszmary commented 7 years ago

If I got it right, now it works: left click on a token, then press s, then select with an arrow, which neighbor you want to merge with it:

1. image

2. image

jonorthwash commented 7 years ago

Cool, what about splitting?

maryszmary commented 7 years ago

You mean, removing the supertoken?

jonorthwash commented 7 years ago

You mean, removing the supertoken?

Yes, that's a good way to think about it.

maryszmary commented 7 years ago

Done.

  1. Select the token to delete: image

  2. Press delete: image

jonorthwash commented 7 years ago

Looking good!

jonorthwash commented 7 years ago

I'm having trouble splitting. It says the feature isn't supported yet? 2017-08-28-15 26 23_001

maryszmary commented 7 years ago

It is not supported only for the sentences with spans because of the shifting issue. As I've written in #63,

The only thing affected by the index shift issue now is merging and splitting tokens, which turned out to be a bit trickier.

I'm working on it.

jonorthwash commented 7 years ago

Ah, I think I misunderstood you at the time. Okay, cool.

ftyers commented 6 years ago

@maryszmary is this fixed now ? I see that #63 is fixed.