mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

Memory leak in spm tokenizers #264

Open nkrasner opened 3 months ago

nkrasner commented 3 months ago

Using the flores101 or flores200 tokenizers is resulting in a memory leak. I am using version 2.4.2 on Windows 11, but the same was also occurring on version 2.4.0.

Running the following results in memory usage increasing linearly until crashing: import sacrebleu while True: sacrebleu.sentence_bleu("Hello world.", ["Hello world."], tokenize="flores101") This is also the case for corpus_bleu. I do not think that it is due to caching since I am running it over and over on the same sentence.