nipunsadvilkar / pySBD

πŸπŸ’―pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
782 stars 82 forks source link

Performance improvements #70

Closed nipunsadvilkar closed 4 years ago

nipunsadvilkar commented 4 years ago

As mentioned in #41 , abbreviation_replacer.py takes too long and needs to be refactored and needs a performance improvement.

Speed Benchmark on bigger text file

Tool Speed
blingfire_tokenize 55.63 ms
nltk_tokenize 198.17 ms
pysbd_tokenize 12846.23 ms
spacy_tokenize 741.54 ms
spacy_dep_tokenize 17642.21 ms
stanza_tokenize 35623.08 ms
syntok_tokenize 1455.21 ms

Text file used: http://www.gutenberg.org/files/1661/1661-0.txt

wget http://www.gutenberg.org/files/1661/1661-0.txt -P benchmarks/

codecov-commenter commented 4 years ago

Codecov Report

Merging #70 into master will decrease coverage by 0.17%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   98.30%   98.13%   -0.18%     
==========================================
  Files          37       37              
  Lines        1063     1071       +8     
==========================================
+ Hits         1045     1051       +6     
- Misses         18       20       +2     
Flag Coverage Ξ”
#unittests 98.13% <100.00%> (-0.18%) :arrow_down:
Impacted Files Coverage Ξ”
pysbd/utils.py 72.41% <ΓΈ> (-0.92%) :arrow_down:
pysbd/abbreviation_replacer.py 100.00% <100.00%> (ΓΈ)
pysbd/lang/bulgarian.py 100.00% <100.00%> (ΓΈ)
pysbd/lang/common/standard.py 100.00% <100.00%> (ΓΈ)
pysbd/lang/deutsch.py 100.00% <100.00%> (ΓΈ)
pysbd/lang/italian.py 100.00% <100.00%> (ΓΈ)
pysbd/lang/russian.py 100.00% <100.00%> (ΓΈ)
pysbd/languages.py 96.87% <100.00%> (ΓΈ)
pysbd/segmenter.py 100.00% <100.00%> (ΓΈ)
pysbd/lang/arabic.py 90.47% <0.00%> (-9.53%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update e6c596f...97ff5c2. Read the comment docs.

nipunsadvilkar commented 4 years ago

Used #71 approach