The tokenizer is missing some abbreviations

helga-lvl commented 3 years ago

Hallóó,

we're making a speech synthesizer and use the shallow tokenizer. The tokenizer splits into sentences for normalization and the sentence structure helps to know where the pause should be for the synthesized speech. However, there are some abbreviations (some common, others not so but still allowed 😊) that the tokenizer does not handle and splits between sentences, which in the most serious cases could prevent the normalization happening, as well as obviously making the phrasing weird. This happens when an abbreviation ends with a period and the tokenizer reads it as end-of-sentence instead of a part of the abbreviation. Could you add them?

The list is:

The reason I started collecting these cases is the following. The normalizer expands the s in "s. 550-1234" to sími ONLY if it's followed by seven digits. However, the tokenizer splits this up to two sentences, making a break between s. and the number. The same applies to rn. (reikningsnúmer). Would it be possible to add this rule? I feel like I have at least written these abbreviations veeery often. 🤪
frák. (fráköst) – normally it's written without a period but it's more correct with the period and the discussion of fráköst feels like the most common one in the whole RMH. (I manually annotated 40,000 random sentences and I think most of them were describing basketball matches.)
ath. (athugið) – it's very common to write this both without a period and not but the tokenizer splits between sentences when the period is there.
ps. – this is not normally written with a dot but someone might have the idea, then it's beneficial to handle it (at least not ambiguous with anything else, right? :))
B.Sc. is correct and not split between sentences but M.Sc. (1375 mentions in RMH) are.
m.v. (miðað við) – occurs 3867 times in RMH but splits between sentences.
vs. (versus) – not so common with the period but occurs 194 times in RMH.
km. (mm, dm, hm, sm, cm, etc.) – I wouldn't write these with a following period but according to RMH a LOT of people (2696 just for km.) do.
kcal. – another case of not the most common with a period (I wouldn't) but more correct.

Takk!

Holado commented 3 years ago

Hæ! Just wanted to let you know I'm looking into this, I'll be back with news hopefully after the weekend!

Holado commented 3 years ago

Hæ,

We have added these abbreviations and implemented tests. They should now work correctly in almost all cases, the only exception I've run into is when it's preceding a name, as it doesn't have the context to see whether the name starts a new sentence or is a continuation of the previous sentence. These exceptions are probably in the minority. You will have to pull the latest changes to get this new functionality. Please let us know if you run into any problems with this or some brand new ones!

helga-lvl commented 3 years ago

Takk kærlega! Það var aðallega sími sem ég náði ekki að norma rétt með þessari skiptingu, hitt eru bara breytingar í tónfalli svo undantekningartilvik sleppa vel. 😊

mideind / Tokenizer

The tokenizer is missing some abbreviations #25