Closed yakzan closed 2 years ago
Here are the results from prun
magic
35027736 function calls (35022153 primitive calls) in 47.196 seconds
Ordered by: cumulative time List reduced from 246 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 47.196 47.196 {built-in method builtins.exec} 1 0.000 0.000 47.196 47.196
:1( ) 1 0.351 0.351 47.196 47.196 turkish_sentence_normalizer.py:114(normalize) 1 5.612 5.612 46.720 46.720 turkish_sentence_normalizer.py:166(decode) 1570984 0.985 0.000 37.300 0.000 smooth_lm.py:130(get_probability) 1570984 1.181 0.000 36.126 0.000 smooth_lm.py:149(get_bigram_probability) 1570984 3.152 0.000 34.944 0.000 smooth_lm.py:159(get_bigram_probability_value) 1570984 12.793 0.000 14.242 0.000 gram_data_array.py:76(check_finger_print) 1570984 6.602 0.000 10.330 0.000 large_ngrammphf.py:38(get) 845115 3.247 0.000 3.732 0.000 gram_data_array.py:48(get_probability_rank) 164042 function calls (161538 primitive calls) in 0.175 seconds
Ordered by: cumulative time List reduced from 244 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.175 0.175 {built-in method builtins.exec} 1 0.000 0.000 0.175 0.175
:1( ) 1 0.001 0.001 0.175 0.175 turkish_sentence_normalizer.py:114(normalize) 1 0.014 0.014 0.112 0.112 turkish_sentence_normalizer.py:166(decode) 3240 0.003 0.000 0.092 0.000 smooth_lm.py:130(get_probability) 3240 0.003 0.000 0.089 0.000 smooth_lm.py:149(get_bigram_probability) 3240 0.008 0.000 0.086 0.000 smooth_lm.py:159(get_bigram_probability_value) 3240 0.033 0.000 0.036 0.000 gram_data_array.py:76(check_finger_print) 98/49 0.001 0.000 0.032 0.001 word_generator.py:25(generate) 3240 0.017 0.000 0.026 0.000 large_ngrammphf.py:38(get)
I am going to have to first compare it to the original code (in Java). Then I will check normalizer's methods. Thanks for the issue.
I've run the Java version on a 40K entry input and it completed in less than a minute; then I've switched to python version and it took hours before I interrupted the kernel to investigate further and found out the examples above. Hope it helps you, thanks :)
still having the issue. Any updates ? @harun-loodos
I think I found and fixed the problem in #11. It was due to my misinterpretation of a Scorable List implementation in Java. Sorry for very late responses :) I think it is good to go. I also released a new version, v0.2.1 with the fix and more. Closing this issue for now, feel free to reopen if you still have problem.
The sentence normalizer works perfect with short sentences but seems to be choking when the input length increases above some threshold of 30 words as shown below:
`from zemberek import ( TurkishSentenceNormalizer, TurkishMorphology ) morphology = TurkishMorphology.create_with_defaults() normalizer = TurkishSentenceNormalizer(morphology)
str_1 = "ABD'nin Louisiana eyaleti ve çevresinde yüz binlerce ev ve iş yerini elektriksiz bırakan Delta Kasırgası'nın bu kez Güney ve Kuzey Carolina eyaletlerini tehdit ettiği açıklandı. "
str_2 = " Louisiana eyaleti ve çevresinde dün 500 bin civarında ev ve iş yerinde elektrik kesintisine sebep olan kasırganın bölgede etkisini azaltarak ülkenin güneydoğu eyaletlerini tehdit etmeye başladığı bildirildi."
normalizer.normalize(str_1) # 188 ms normalizer.normalize(str_2) # 176 ms normalizer.normalize(str_1 + str_2) # 37.6 s
Furthermore, adding two words
Bunun ıcınto
str_2` increases the runtime to 3 minutes 16 seconds.