microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
323 stars 84 forks source link

Fix batching in fairseq SentencepieceTokenizer #640

Closed Craigacp closed 8 months ago

Craigacp commented 8 months ago

This fixes #639 by moving the fairseq id patching out of the loop so it's not applied multiple times to sequences earlier in the batch.

I've added the check to ensure it's not applied when add_rev is true mirroring its position in the for loop, but I'm unsure if that's required (presumably there are no HF tokenizers which emit a reversed string and have this fairseq hack?).

I tested it against the example in #639 and both batches and single sequences work correctly now.

wenbingl commented 8 months ago

/azp run onnxruntime-extensions.CI

azure-pipelines[bot] commented 8 months ago
No pipelines are associated with this pull request.
wenbingl commented 8 months ago

/azp run onnxruntime-extensions.CI

azure-pipelines[bot] commented 8 months ago
Azure Pipelines successfully started running 1 pipeline(s).
Craigacp commented 8 months ago

I ran the Python tests locally on macOS ARM64 and everything passed. The access violation in the Windows tests happens either in test_cliptok or test_cv2 both of which passed fine on my machine and don't use the code I changed, and they also passed fine in Linux and macOS in the CI. Is there anything more I can do to run the failure down?

wenbingl commented 8 months ago

@sayanshaw24 , can you review this PR?