Open arademaker opened 4 years ago
Losing the original text?
not really losing, but not showing it properly indeed.
the proper tokenization (using sep
as separator when available and a space as default) could be implemented, but then I'm not sure if any other corpus will have sep
attributes to make it worthwhile… how is tokenization described by other tokenizers? could touch.py
produce something akin to sep
? would it be useful to do so?
Losing the original text? Is it the right thing to do?
we do have the
sep
for produze the original text. Question is:Remember that default sep is space, so when a token doesn't have
sep
it is assumedsep=" "
. See confusing explanation in https://github.com/own-pt/glosstag/blob/princeton/dtd/glosstag.dtd#L158-L161 for the glosstag corpus !!