mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

The tokenizer is missing some abbreviations #25

Closed helga-lvl closed 3 years ago

helga-lvl commented 3 years ago

Hallóó,

we're making a speech synthesizer and use the shallow tokenizer. The tokenizer splits into sentences for normalization and the sentence structure helps to know where the pause should be for the synthesized speech. However, there are some abbreviations (some common, others not so but still allowed 😊) that the tokenizer does not handle and splits between sentences, which in the most serious cases could prevent the normalization happening, as well as obviously making the phrasing weird. This happens when an abbreviation ends with a period and the tokenizer reads it as end-of-sentence instead of a part of the abbreviation. Could you add them?

The list is:

Takk!

Holado commented 3 years ago

Hæ! Just wanted to let you know I'm looking into this, I'll be back with news hopefully after the weekend!

Holado commented 3 years ago

Hæ,

We have added these abbreviations and implemented tests. They should now work correctly in almost all cases, the only exception I've run into is when it's preceding a name, as it doesn't have the context to see whether the name starts a new sentence or is a continuation of the previous sentence. These exceptions are probably in the minority. You will have to pull the latest changes to get this new functionality. Please let us know if you run into any problems with this or some brand new ones!

helga-lvl commented 3 years ago

Takk kærlega! Það var aðallega sími sem ég náði ekki að norma rétt með þessari skiptingu, hitt eru bara breytingar í tónfalli svo undantekningartilvik sleppa vel. 😊