nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

Reduce some calls to re.sub #50

Closed dakinggg closed 5 years ago

dakinggg commented 5 years ago

So calls to re.compile are not a problem. The main thing slowing it down is lots of calls to re.sub in abbreviation_replacer.py. I reduced some of these calls which speeds it up by a factor of ~3-3.5x on my machine, for the specific (longish) document that I tested with. I also included the script I used to test timing. Given that you are much more familiar with the codebase, see if my changes look reasonable, but all the tests do still pass. There are probably some more ways to speed up the calls in that file.

nipunsadvilkar commented 5 years ago

LGTM! Yes, I concur there are a lot of inefficient loops and calls happening in abbreviation_replacer.py. Since I was porting it from ruby, I thought of keeping the same loops and calls in python to get tests to pass. In a few cases, I tried to be idiomatic wherever possible, but still, there are a lot of places where we can surely improve pysbd.

It would be nice if you could list those speed up calls in the comment here so we can tackle them one by one.

dakinggg commented 5 years ago

I was focusing on the parts that were the majority of the time spent. Figured no need to optimize parts that weren't taking up much of the time. Here is some of the output from cProfile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.484    1.484 <string>:1(<module>)
      366    0.003    0.000    0.258    0.001 abbreviation_replacer.py:10(replace_pre_number_abbr)
     1492    0.004    0.000    1.145    0.001 abbreviation_replacer.py:103(scan_for_replacements)
      222    0.002    0.000    0.157    0.001 abbreviation_replacer.py:19(replace_prepositive_abbr)
      904    0.007    0.000    0.726    0.001 abbreviation_replacer.py:28(replace_period_of_abbr)
        1    0.000    0.000    0.013    0.013 abbreviation_replacer.py:43(replace_abbreviation_as_sentence_boundary)
        1    0.000    0.000    0.000    0.000 abbreviation_replacer.py:54(__init__)
        1    0.000    0.000    1.400    1.400 abbreviation_replacer.py:58(replace)
        1    0.000    0.000    0.001    0.001 abbreviation_replacer.py:70(replace_multi_period_abbreviations)
        4    0.000    0.000    0.000    0.000 abbreviation_replacer.py:71(mpa_replace)
        1    0.005    0.005    1.380    1.380 abbreviation_replacer.py:83(search_for_abbreviations_in_string)

Didn't want to cause code churn/potentially introduce bugs by over optimizing other pieces of code. I also don't actually have clear ideas at the moment for how to make those slow pieces even faster.

dakinggg commented 5 years ago

Any hesitation on releasing this as is?

nipunsadvilkar commented 5 years ago

Although, I wanted to make more optimizations in abbreviation_replacer.py but didn't find much time to do it thoroughly so releasing it for now and we can do further optimizations in the future.

Just to segregate scripts as per purpose, moving your test_timing_script.py to examples for further reference.