tokenization issues - Githubissues

own-pt / glosstag

Semantically Tagged PWN glosses

Other

7 stars 4 forks source link

Closed arademaker closed 2 years ago

arademaker commented 2 years ago

after #9

we still have cases were names 'A.B.Fulano' in a single token
we may have other tokens that need to be split. We can search for . or - inside token forms.
we have some cases of WF tokens with sep=`, the space is the default sep, need to remove those cases and check if the detokenization approach still works matching thetext` field.

arademaker commented 2 years ago

this issue was closed with the above commits.