nlpodyssey / cybertron

Cybertron: the home planet of the Transformers in Go
BSD 2-Clause "Simplified" License
280 stars 26 forks source link

tokenizers/sentencepiece: Improve performance by removing allocations #42

Closed damz closed 2 months ago

damz commented 2 months ago

The sentencepiece tokenizer is pretty slow, mostly because it allocates a ton. This PR removes some of the low-hanging fruit allocations.

goos: linux
goarch: amd64
pkg: github.com/nlpodyssey/cybertron/pkg/tokenizers/sentencepiece/internal/sentencepiece
cpu: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
                                      │   old.txt    │               new.txt               │
                                      │    sec/op    │   sec/op     vs base                │
SentencePiece/compose_email_to_joh-36   2235.6µ ± 1%   147.1µ ± 1%  -93.42% (p=0.000 n=10)

                                      │    old.txt     │               new.txt                │
                                      │      B/op      │     B/op      vs base                │
SentencePiece/compose_email_to_joh-36   3289.19Ki ± 0%   27.02Ki ± 0%  -99.18% (p=0.000 n=10)

                                      │   old.txt    │              new.txt               │
                                      │  allocs/op   │ allocs/op   vs base                │
SentencePiece/compose_email_to_joh-36   1830.00 ± 0%   92.00 ± 0%  -94.97% (p=0.000 n=10)
codecov-commenter commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 40.53%. Comparing base (d0c62f8) to head (88b93f4).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #42 +/- ## ========================================== - Coverage 40.65% 40.53% -0.13% ========================================== Files 16 16 Lines 1429 1426 -3 ========================================== - Hits 581 578 -3 Misses 826 826 Partials 22 22 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.