latin-language-toolkit / llt-tokenizer

Tokenizes Latin (and Greek) texts
MIT License
10 stars 2 forks source link

weird tokenization of nunc #34

Open balmas opened 9 years ago

balmas commented 9 years ago

tokenize this text: "+Noster+poēta+,+nisi+cīvis+Rōmānus+esset+,+ā+populō+nunc+cīvitāte+dōnārētur+.+" (e.g. http://services.perseids.org/llt/segtok?xml=false&shifting=false&newline_boundary=1&inline=true&text=%22+Noster+po%C4%93ta+,+nisi+c%C4%ABvis+R%C5%8Dm%C4%81nus+esset+,+%C4%81+popul%C5%8D+nunc+c%C4%ABvit%C4%81te+d%C5%8Dn%C4%81r%C4%93tur+.+%22&splitting=true)

See that nunc gets tokenized as:

<w s_n="1" n="12">nun</w><pc s_n="1" n="13">-</pc><pc s_n="1" n="14">-</pc><pc s_n="1" n="15">-</pc><w s_n="1" n="16">-c</w>

lichtr commented 9 years ago

I had a quick look at this: I could not solve the problem, but I think it has something to do with the nisi in front which is not splitted. I guess the problem lies somewhere in the Worker class...

balmas commented 9 years ago

Another example, even weirder:

Nisi pācem sine morā ab hostibus petīverimus, neque urbs neque domus ūlla stāre poterit.

crashes the server, but if I take the long i out of petīverimus it doesn't

Nisi pācem sine morā ab hostibus petīverimus, neque urbs neque domus ūlla stāre poterit.