mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

A dot added in dates #52

Open starkadur opened 5 days ago

starkadur commented 5 days ago

If I send in "17 júní" the tokenizer returns 17. júní". Even though I use tokenized() (and not split_itsentences()) and use the txt-property (which should contain the original source text for the token) I still get this extra dot.

peturorri commented 5 days ago

I think you're looking for the original property of the tokens, not txt. See: https://github.com/mideind/Tokenizer/blob/master/src/tokenizer/tokenizer.py#L95

starkadur commented 4 days ago

Do all tokens have the original property? I always get error when trying to access it: txt = token.original causes an error while txt = token.txt does not.

peturorri commented 3 days ago

They should all have original although it can sometimes be None.

Can you provide a complete example of the code you're trying to run, and the version of the tokenizer package.