savoirfairelinux / num2words

Modules to convert numbers to words. 42 --> forty-two
GNU Lesser General Public License v2.1
813 stars 487 forks source link

Improve Turkish implementation #534

Open gorkemgoknar opened 11 months ago

gorkemgoknar commented 11 months ago

Expected Behaviour

*to not change default behaviour , there should be an override to add spaces between words (it fails on TTS like this)

Actual Behaviour

num2words(12455544, lang="tr") 'onikimilyondörtyüzellibeşbinbeşyüzkırkdört'

num2words(1.5, lang="tr") 'birvirgülelli'

num2words(1.2455544, lang="tr") 'birvirgülyirmidört' >> missing

on english num2words(1.2455544, lang="en") 'one point two four five five five four four'

Steps to reproduce

num2words(12455544, lang="tr") 'onikimilyondörtyüzellibeşbinbeşyüzkırkdört'

num2words(1.5, lang="tr") 'birvirgülelli'

num2words(1.2455544, lang="tr") 'birvirgülyirmidört'

gorkemgoknar commented 11 months ago

VNLP itself handles better, maybe get that implementation (giving comma and dot spelling option)

https://github.com/vngrs-ai/vnlp/blob/b5011692c997b9d110827421d491bb3492d3b5dd/vnlp/normalizer/normalizer.py#L200

from vnlp import Normalizer
normalizer = Normalizer()

normalizer .convert_numbers_to_words(["1.523233351"],decimal_seperator=".")
#['bir', 'virgül', 'beş', 'yüz', 'yirmi', 'üç', 'bin', 'iki', 'yüz', 'otuz', 'üç']

normalizer .convert_numbers_to_words(["1.523233351"]) # by default incorrect, better spell command instead of very big number, so num2words should maybe have this implemetation (then can option to join with space or no space
#['bir', 'katrilyon', 'beş', 'yüz', 'yirmi', 'üç', 'milyon', 'iki', 'yüz', 'otuz', 'üç', 'bin', 'üç', 'yüz', 'elli', 'bir']
gorkemgoknar commented 11 months ago

From TDK (turkish language) https://tdk.gov.tr/icerik/yazim-kurallari/sayilarin-yazilisi/#:~:text=Birden%20fazla%20kelimeden%20olu%C5%9Fan%20say%C4%B1lar,35%20(alt%C4%B1y%C3%BCzelliTL%2Cotuzbe%C5%9Fkr.)

  1. Birden fazla kelimeden oluşan sayılar ayrı yazılır: iki yüz, üç yüz altmış beş, bin iki yüz elli bir vb.

  2. Para ile ilgili işlemlerle senet, çek vb. ticari belgelerde geçen sayılar bitişik yazılır: 650,35 (altıyüzelliTL,otuzbeşkr.)

So num2words intended for "monetary/currency" only for Turkish (clause 3, but not for actually word spelling).

Edit: This needs fix too , zero after point is not spelled num2words(84003.01, lang='tr')

mrodriguezg1991 commented 11 months ago

Hello, i am currently maintaining the project, however i dont speak the language, if you have the time you could submit a MR and we can include it on the next release Thanks

gorkemgoknar commented 11 months ago

Hello, i am currently maintaining the project, however i dont speak the language, if you have the time you could submit a MR and we can include it on the next release Thanks

sure will improve upon it when have time. One question though: I think it would be wise to make these changes overridable/optional as I guess some people are using it as is, though cannot confirm.