Open WA225 opened 1 month ago
Describe the bug Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.
To Reproduce Steps to reproduce the behavior: import onnxruntime_genai as og from datasets import load_dataset
tokenizer = og.Tokenizer(model) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") tokenizer.encode("\n\n".join(testdata["text"]))
Expected behavior Return a list of encoded wikitext2 dataset.
Desktop (please complete the following information):
- OS: Windows 11
- onnxruntime-genai-0.4.0
Consider this is a very large text operation, how long do you wait for the result?
@WA225 , Thanks for reporting this issue. The sentencepiece converted tokenizer does not split long texts into smaller segments before applying BPE merges. This can lead to longer processing times for lengthy texts, though results can still be achieved. As a workaround, you can process the text sentence by sentence rather than in a single batch. We will incorporate code to handle text splitting into the tokenization process to address this issue soon.
Thank you for your help @wenbingl
I have made a PR to avoid this slowness of a long text, https://github.com/microsoft/onnxruntime-extensions/pull/799. After it is integrated into genai, you can have a try again.
Describe the bug Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.
To Reproduce Steps to reproduce the behavior: import onnxruntime_genai as og from datasets import load_dataset
tokenizer = og.Tokenizer(model) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") tokenizer.encode("\n\n".join(testdata["text"]))
Expected behavior Return a list of encoded wikitext2 dataset.
Desktop (please complete the following information):