microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
295 stars 80 forks source link

Unify the spm/bpe tokenizers #666

Closed wenbingl closed 4 months ago

wenbingl commented 4 months ago

561 fixed

wenbingl commented 4 months ago

All looks good. My only comment is that KernelBpeTokenizer::SpmTokenize does not support fairseq or computing instance indices at the moment.

But I noticed you added the SPM implementation to bpe_kernels.cc but did not delete sentencepiece_tokenizer.cc, sentencepiece_tokenizer.hpp, or sentencepiece_decoder.hpp. I assume we will do this in the future as that is the main incentive of this unification, and when we do, we can also take care of the remaining SPM stuff (fairseq and instance indices) in the bpe source.

There will be a transition period. The old sentencepiece tokenizer code needs to be kept for Custom-Op backward compatibility. The faireseq indices will be added in the future.