Unify the spm/bpe tokenizers

All looks good. My only comment is that KernelBpeTokenizer::SpmTokenize does not support fairseq or computing instance indices at the moment.

But I noticed you added the SPM implementation to bpe_kernels.cc but did not delete sentencepiece_tokenizer.cc, sentencepiece_tokenizer.hpp, or sentencepiece_decoder.hpp. I assume we will do this in the future as that is the main incentive of this unification, and when we do, we can also take care of the remaining SPM stuff (fairseq and instance indices) in the bpe source.

There will be a transition period. The old sentencepiece tokenizer code needs to be kept for Custom-Op backward compatibility. The faireseq indices will be added in the future.

microsoft / onnxruntime-extensions

Unify the spm/bpe tokenizers #666

561 fixed