nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
802 stars 83 forks source link

Performance improvement? #41

Closed dakinggg closed 4 years ago

dakinggg commented 5 years ago

I am not certain of this, but I suspect there might be room for performance improvement by using re.compile to precompile all of the needed regexs. Otherwise they will have to be compiled regularly (once the re cache of 100 has been exceeded)

nipunsadvilkar commented 5 years ago

I don't think there will be any perceivable difference unless some of the regexes are in the loop.

Quote from the Python 3 docs for your perusal:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

dakinggg commented 5 years ago

I think that if there are more than 100 regexs (including any that are in loops like here: https://github.com/nipunsadvilkar/pySBD/blob/a2bb4510e65b1fd6c37e2162005076db0e984fb0/pysbd/abbreviation_replacer.py#L62) the cache will cycle, and regexs will have to be recompiled. Given that the number of ABBREVIATIONS alone is 188, I suspect that the cache is cycling.

dakinggg commented 5 years ago

Here is another regex in a loop: https://github.com/nipunsadvilkar/pySBD/blob/a2bb4510e65b1fd6c37e2162005076db0e984fb0/pysbd/lists_item_replacer.py#L115

nipunsadvilkar commented 5 years ago

Yes, I agree there will be such a regex within a loop. In that case, would you mind tweaking it with precompiled ones and assess the performance? I would love to assist with that.

dakinggg commented 5 years ago

Not sure when I will have time to do it, but I can try at some point

nipunsadvilkar commented 5 years ago

Cool, same here! will let you know if I happen to do this performance exercise.

nipunsadvilkar commented 4 years ago

Fixed #71