loodos / zemberek-python

Python implementation of Zemberek
Other
112 stars 14 forks source link

Runtime for normalizer increases exponentially with input length and letter `ı` #2

Closed yakzan closed 2 years ago

yakzan commented 3 years ago

The sentence normalizer works perfect with short sentences but seems to be choking when the input length increases above some threshold of 30 words as shown below:

`from zemberek import ( TurkishSentenceNormalizer, TurkishMorphology ) morphology = TurkishMorphology.create_with_defaults() normalizer = TurkishSentenceNormalizer(morphology)

str_1 = "ABD'nin Louisiana eyaleti ve çevresinde yüz binlerce ev ve iş yerini elektriksiz bırakan Delta Kasırgası'nın bu kez Güney ve Kuzey Carolina eyaletlerini tehdit ettiği açıklandı. "

str_2 = " Louisiana eyaleti ve çevresinde dün 500 bin civarında ev ve iş yerinde elektrik kesintisine sebep olan kasırganın bölgede etkisini azaltarak ülkenin güneydoğu eyaletlerini tehdit etmeye başladığı bildirildi."

normalizer.normalize(str_1) # 188 ms normalizer.normalize(str_2) # 176 ms normalizer.normalize(str_1 + str_2) # 37.6 s Furthermore, adding two wordsBunun ıcın tostr_2` increases the runtime to 3 minutes 16 seconds.

yakzan commented 3 years ago

Here are the results from prun magic

   35027736 function calls (35022153 primitive calls) in 47.196 seconds

Ordered by: cumulative time List reduced from 246 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 47.196 47.196 {built-in method builtins.exec} 1 0.000 0.000 47.196 47.196 :1() 1 0.351 0.351 47.196 47.196 turkish_sentence_normalizer.py:114(normalize) 1 5.612 5.612 46.720 46.720 turkish_sentence_normalizer.py:166(decode) 1570984 0.985 0.000 37.300 0.000 smooth_lm.py:130(get_probability) 1570984 1.181 0.000 36.126 0.000 smooth_lm.py:149(get_bigram_probability) 1570984 3.152 0.000 34.944 0.000 smooth_lm.py:159(get_bigram_probability_value) 1570984 12.793 0.000 14.242 0.000 gram_data_array.py:76(check_finger_print) 1570984 6.602 0.000 10.330 0.000 large_ngrammphf.py:38(get) 845115 3.247 0.000 3.732 0.000 gram_data_array.py:48(get_probability_rank)

   164042 function calls (161538 primitive calls) in 0.175 seconds

Ordered by: cumulative time List reduced from 244 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.175 0.175 {built-in method builtins.exec} 1 0.000 0.000 0.175 0.175 :1() 1 0.001 0.001 0.175 0.175 turkish_sentence_normalizer.py:114(normalize) 1 0.014 0.014 0.112 0.112 turkish_sentence_normalizer.py:166(decode) 3240 0.003 0.000 0.092 0.000 smooth_lm.py:130(get_probability) 3240 0.003 0.000 0.089 0.000 smooth_lm.py:149(get_bigram_probability) 3240 0.008 0.000 0.086 0.000 smooth_lm.py:159(get_bigram_probability_value) 3240 0.033 0.000 0.036 0.000 gram_data_array.py:76(check_finger_print) 98/49 0.001 0.000 0.032 0.001 word_generator.py:25(generate) 3240 0.017 0.000 0.026 0.000 large_ngrammphf.py:38(get)

harun-loodos commented 3 years ago

I am going to have to first compare it to the original code (in Java). Then I will check normalizer's methods. Thanks for the issue.

yakzan commented 3 years ago

I've run the Java version on a 40K entry input and it completed in less than a minute; then I've switched to python version and it took hours before I interrupted the kernel to investigate further and found out the examples above. Hope it helps you, thanks :)

Winvoker commented 2 years ago

still having the issue. Any updates ? @harun-loodos

harun-loodos commented 2 years ago

I think I found and fixed the problem in #11. It was due to my misinterpretation of a Scorable List implementation in Java. Sorry for very late responses :) I think it is good to go. I also released a new version, v0.2.1 with the fix and more. Closing this issue for now, feel free to reopen if you still have problem.