Closed dakinggg closed 5 years ago
LGTM! Yes, I concur there are a lot of inefficient loops and calls happening in abbreviation_replacer.py
. Since I was porting it from ruby, I thought of keeping the same loops and calls in python to get tests to pass. In a few cases, I tried to be idiomatic wherever possible, but still, there are a lot of places where we can surely improve pysbd.
It would be nice if you could list those speed up calls in the comment here so we can tackle them one by one.
I was focusing on the parts that were the majority of the time spent. Figured no need to optimize parts that weren't taking up much of the time. Here is some of the output from cProfile
:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.484 1.484 <string>:1(<module>)
366 0.003 0.000 0.258 0.001 abbreviation_replacer.py:10(replace_pre_number_abbr)
1492 0.004 0.000 1.145 0.001 abbreviation_replacer.py:103(scan_for_replacements)
222 0.002 0.000 0.157 0.001 abbreviation_replacer.py:19(replace_prepositive_abbr)
904 0.007 0.000 0.726 0.001 abbreviation_replacer.py:28(replace_period_of_abbr)
1 0.000 0.000 0.013 0.013 abbreviation_replacer.py:43(replace_abbreviation_as_sentence_boundary)
1 0.000 0.000 0.000 0.000 abbreviation_replacer.py:54(__init__)
1 0.000 0.000 1.400 1.400 abbreviation_replacer.py:58(replace)
1 0.000 0.000 0.001 0.001 abbreviation_replacer.py:70(replace_multi_period_abbreviations)
4 0.000 0.000 0.000 0.000 abbreviation_replacer.py:71(mpa_replace)
1 0.005 0.005 1.380 1.380 abbreviation_replacer.py:83(search_for_abbreviations_in_string)
Didn't want to cause code churn/potentially introduce bugs by over optimizing other pieces of code. I also don't actually have clear ideas at the moment for how to make those slow pieces even faster.
Any hesitation on releasing this as is?
Although, I wanted to make more optimizations in abbreviation_replacer.py
but didn't find much time to do it thoroughly so releasing it for now and we can do further optimizations in the future.
Just to segregate scripts as per purpose, moving your test_timing_script.py
to examples
for further reference.
So calls to
re.compile
are not a problem. The main thing slowing it down is lots of calls tore.sub
inabbreviation_replacer.py
. I reduced some of these calls which speeds it up by a factor of ~3-3.5x on my machine, for the specific (longish) document that I tested with. I also included the script I used to test timing. Given that you are much more familiar with the codebase, see if my changes look reasonable, but all the tests do still pass. There are probably some more ways to speed up the calls in that file.