Closed wenbingl closed 4 months ago
All looks good. My only comment is that
KernelBpeTokenizer::SpmTokenize
does not support fairseq or computing instance indices at the moment.But I noticed you added the SPM implementation to bpe_kernels.cc but did not delete sentencepiece_tokenizer.cc, sentencepiece_tokenizer.hpp, or sentencepiece_decoder.hpp. I assume we will do this in the future as that is the main incentive of this unification, and when we do, we can also take care of the remaining SPM stuff (fairseq and instance indices) in the bpe source.
There will be a transition period. The old sentencepiece tokenizer code needs to be kept for Custom-Op backward compatibility. The faireseq indices will be added in the future.
561 fixed