own-pt / glosstag

Semantically Tagged PWN glosses
Other
7 stars 4 forks source link

tokenization issues #28

Closed arademaker closed 1 year ago

arademaker commented 1 year ago

after #9

  1. we still have cases were names 'A.B.Fulano' in a single token
  2. we may have other tokens that need to be split. We can search for . or - inside token forms.
  3. we have some cases of WF tokens with sep=`, the space is the default sep, need to remove those cases and check if the detokenization approach still works matching thetext` field.
arademaker commented 1 year ago

this issue was closed with the above commits.