microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
323 stars 84 forks source link

Fairseq support in SentencepieceTokenizer doesn't work with batches #639

Closed Craigacp closed 8 months ago

Craigacp commented 8 months ago

I've exported some HF Sentencepiece tokenizers that use the fairseq support using the SentencepieceTokenizer op (setting the fairseq flag as an input to the tokenizer op). This works when there is a single element that I'm tokenizing, but fails for batches (I have logic elsewhere in my ONNX export which appropriately batches the sentencepiece output and adds the right padding). The issue is that the fairseq transformation here runs from content.begin() to content.end(), but it's inside the loop over input strings, so it is applied multiple times. I think the fix is just to move the fairseq correction outside the for loop (guarded by add_rev as well), and I can send a PR for that if you want.

For example, on this batch of three sentences the first sentence has the BOS 4 higher, and the rest of the tokens 2 higher, the second sentence has it's BOS 3 higher and the rest of the tokens 1 higher, while the last sentence is correct. The example uses this tokenizer, though the issue is present in all things which hit the fairseq correction.

HF:
[0, 1514, 5713, 5655, 716, 4783, 13, 2148, 9407, 7, 2],
[0, 24340, 4197, 29833, 427, 6125, 29841, 24135, 29829, 77, 30084, 752, 3000, 2],
[0, 29834, 30009, 3579, 3375, 29832, 5713, 5655, 716, 29843, 2148, 1567, 742, 3122, 7, 2]

ONNX:
[4, 1516, 5715, 5657, 718, 4785, 15, 2150, 9409, 9, 2],
[3, 24341, 4198, 29834, 428, 6126, 29842, 24136, 29830, 78, 30085, 753, 3001, 2],
[0, 29834, 30009, 3579, 3375, 29832, 5713, 5655, 716, 29843, 2148, 1567, 742, 3122, 7, 2]