I've exported some HF Sentencepiece tokenizers that use the fairseq support using the SentencepieceTokenizer op (setting the fairseq flag as an input to the tokenizer op). This works when there is a single element that I'm tokenizing, but fails for batches (I have logic elsewhere in my ONNX export which appropriately batches the sentencepiece output and adds the right padding). The issue is that the fairseq transformation here runs from content.begin() to content.end(), but it's inside the loop over input strings, so it is applied multiple times. I think the fix is just to move the fairseq correction outside the for loop (guarded by add_rev as well), and I can send a PR for that if you want.
For example, on this batch of three sentences the first sentence has the BOS 4 higher, and the rest of the tokens 2 higher, the second sentence has it's BOS 3 higher and the rest of the tokens 1 higher, while the last sentence is correct. The example uses this tokenizer, though the issue is present in all things which hit the fairseq correction.
I've exported some HF Sentencepiece tokenizers that use the fairseq support using the SentencepieceTokenizer op (setting the fairseq flag as an input to the tokenizer op). This works when there is a single element that I'm tokenizing, but fails for batches (I have logic elsewhere in my ONNX export which appropriately batches the sentencepiece output and adds the right padding). The issue is that the fairseq transformation here runs from
content.begin()
tocontent.end()
, but it's inside the loop over input strings, so it is applied multiple times. I think the fix is just to move the fairseq correction outside the for loop (guarded by add_rev as well), and I can send a PR for that if you want.For example, on this batch of three sentences the first sentence has the BOS 4 higher, and the rest of the tokens 2 higher, the second sentence has it's BOS 3 higher and the rest of the tokens 1 higher, while the last sentence is correct. The example uses this tokenizer, though the issue is present in all things which hit the fairseq correction.