mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

Bigger ordinal numbers in the tokenizer #28

Closed helga-lvl closed 3 years ago

helga-lvl commented 3 years ago

Hi! I'm making a normalizer and have made rules that recognize both cardinal and ordinal numbers up to 999 billions (999.999.999.999. is the highest ordinal number, I don't expect anyone to ever write this but whatever). I use the tokenizer to split up to sentences and was wondering about the thought behind when the tokenizer should recognize an ordinal number and when it's read as a cardinal number and an end of a sentence. I did an experiment:

try_ordinal = "Hæ, þetta er 5. dagurinn, þetta er 51. dagurinn, þetta er 512. dagurinn, þetta er 5123. dagurinn, " + \
              "þetta er 5.234. dagurinn, þetta er 51234. dagurinn, þetta er 52.345. dagurinn, þetta er 512345. " + \
              "dagurinn, þetta er 523.456. dagurinn, þetta er 5123456. dagurinn, þetta er 5.234.567. dagurinn."

>>> list(split_into_sentences(try_ordinal))
['Hæ , þetta er 5. dagurinn , þetta er 51. dagurinn , þetta er 512. dagurinn , þetta er 5123. dagurinn , þetta er 5.234 .',
 'dagurinn , þetta er 51234. dagurinn , þetta er 52.345 .',
 'dagurinn , þetta er 512345. dagurinn , þetta er 523.456 .',
 'dagurinn , þetta er 5123456 .',
 'dagurinn , þetta er 5.234.567 .',
 'dagurinn .']

I've generally tried to keep in periods every third digit, if someone wants to write 8923402 it is recognized as a sequence of digits (a phone number, átta níu tveir þrír fjórir núll tveir, not átta milljónir níu hundruð tuttugu og þrjú þúsund fjögur hundruð og tvö, that would be ridiculous). However if someone actually writes 8.923.402 they get the millions because they were clear with the periods. 🙂

So the normalizer recognizes everything over 9999. ONLY as ordinals with the period separators but the tokenizer wants nothing to do with them. Is there reasoning behind this? Of course my reasoning is only my personal opinion so I'm very open to the conversation. 😊 Have you assessed that no one will ever write such big ordinals? At least I think numbers with periods (like 52.345.) for clarity should work as well as the other numbers!

Thank you 😁

Holado commented 3 years ago

I've added support for higher ordinals. The main logic behind the former behaviour was that such high ordinals were very unlikely. About the only current exceptions are strings such as '5123456', which are interpreted as phone numbers, as that's much more likely. If it were '3123456' it's not a valid phone number, so it's interpreted as an ordinal/cardinal number.

I hope this addresses the problem adequately!