microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
295 stars 80 forks source link

Add the tokenizer C ABI #693

Closed wenbingl closed 2 months ago

sayanshaw24 commented 2 months ago

I wonder if even for unit tests here we can extract test data JSON files from HF to avoid adding the large files to our repo? Since we'll only be running the tests when we make changes to the C API and end users won't need to run them, the time taken to download them at runtime should be fine.

wenbingl commented 2 months ago

I wonder if even for unit tests here we can extract test data JSON files from HF to avoid adding the large files to our repo? Since we'll only be running the tests when we make changes to the C API and end users won't need to run them, the time taken to download them at runtime should be fine.

Downloading HF data from C native tests will add extra code dependency on test. Unless the tokenizer data size become much larger than current ones, we may not need to worry about it now.