microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
295 stars 80 forks source link

Add initial tiktoken support #729

Open sayanshaw24 opened 1 month ago

sayanshaw24 commented 4 weeks ago

Where is the unit test?

the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer.

it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653

wenbingl commented 4 weeks ago

Where is the unit test?

the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer.

it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653

that was testing a converted HF json file, instead of tiktoken file. Please create your own test.

sayanshaw24 commented 4 weeks ago

Where is the unit test?

the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer. it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653

that was testing a converted HF json file, instead of tiktoken file. Please create your own test.

I thought this one was for the converted HF json file?: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/pp_api_test/test_tokenizer.cc#L93 i remember when you added this and data/tiktoken even has the converted tokenizer.json file and when I asked you mentioned this was the converted HF json file.

This one is different: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/pp_api_test/test_tokenizer.cc#L120 maybe you added it twice accidentally or even for local testing? I had basically the same test locally but did not stage it since you already added it. It's not being run as it stands right now because there is nothing in test/data2 folder, so it is skipped. With this PR, it is being run because I added the necessary .tiktoken and tokenizer_config.json files in test/data2/phi-3-small.

But regardless, I have no issues adding another UT - let me know if you would like me to. Just wanted to clarify about this benign test at the moment since Phi3_S_Tokenizer right now is not run (I thought you added it for this purpose).

Sorry for the confusion.