mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

Twitter handles and @usernames can contain periods (@mat­ur.a.mbl) but are broken into sentences #18

Closed sveinbjornt closed 3 years ago

sveinbjornt commented 4 years ago

The following text

Þetta var notandinn @matur.a.mbl á Twitter.

becomes

Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='Þetta', val=None)
Tok(kind=6, txt='var', val=None)
Tok(kind=6, txt='notandinn', val=None)
Tok(kind=28, txt='@matur', val='matur')
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)
Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='a.mbl', val=None)
Tok(kind=6, txt='á', val=None)
Tok(kind=6, txt='Twitter', val=None)
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)
Holado commented 3 years ago

Support has been added.

Þetta var notandinn @matur.a.mbl á Twitter.

now becomes

{"k":"BEGIN SENT","t":""} {"k":"WORD","t":"Þetta","o":"Þetta","s":[0,1,2,3,4]} {"k":"WORD","t":"var","o":" var","s":[1,2,3]} {"k":"WORD","t":"notandinn","o":" notandinn","s":[1,2,3,4,5,6,7,8,9]} {"k":"USERNAME","t":"@matur.a.mbl","v":"matur.a.mbl","o":" @matur.a.mbl","s":[1,2,3,4,5,6,7,8,9,10,11,12]} {"k":"WORD","t":"á","o":" á","s":[1]} {"k":"WORD","t":"Twitter","o":" Twitter","s":[1,2,3,4,5,6,7]} {"k":"PUNCTUATION","t":".","v":".","o":".","s":[0]} {"k":"END SENT","t":""}