issues
search
mideind
/
Tokenizer
A tokenizer for Icelandic text
Other
27
stars
6
forks
source link
Domains
#8
Closed
sveinbjornt
closed
5 years ago
sveinbjornt
commented
5 years ago
New token type: Hashtag
New token type: Domain
Now identifies Icelandic phone numbers containing a space (e.g. "699 4224")
Numbers only identified as phone numbers if they start with a valid number (i.e. 4,5,6,7, or 8)
Now identifies currency abbr. followed by number as an amount (e.g. kr. 9.900", "USD 200")
Tests & documentation for new token types
Added Nm SI unit
Additions to abbreviations
Moved most definitions from tokenizer.py to definitions.py