Open starkadur opened 5 days ago
I think you're looking for the original
property of the tokens, not txt
. See: https://github.com/mideind/Tokenizer/blob/master/src/tokenizer/tokenizer.py#L95
Do all tokens have the original
property? I always get error when trying to access it:
txt = token.original
causes an error while
txt = token.txt
does not.
They should all have original
although it can sometimes be None
.
Can you provide a complete example of the code you're trying to run, and the version of the tokenizer
package.
If I send in "17 júní" the tokenizer returns 17. júní". Even though I use tokenized() (and not split_itsentences()) and use the txt-property (which should contain the original source text for the token) I still get this extra dot.