Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs

microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime

MIT License

443 stars 105 forks source link

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

Open WA225 opened 1 month ago

WA225 commented 1 month ago

Describe the bug Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.

To Reproduce Steps to reproduce the behavior: import onnxruntime_genai as og from datasets import load_dataset

tokenizer = og.Tokenizer(model) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") tokenizer.encode("\n\n".join(testdata["text"]))

Expected behavior Return a list of encoded wikitext2 dataset.

Desktop (please complete the following information):

OS: Windows 11
onnxruntime-genai-0.4.0

wenbingl commented 1 month ago

Describe the bug Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.

To Reproduce Steps to reproduce the behavior: import onnxruntime_genai as og from datasets import load_dataset

tokenizer = og.Tokenizer(model) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") tokenizer.encode("\n\n".join(testdata["text"]))

Expected behavior Return a list of encoded wikitext2 dataset.

Desktop (please complete the following information):

OS: Windows 11

onnxruntime-genai-0.4.0

Consider this is a very large text operation, how long do you wait for the result?

wenbingl commented 1 month ago

@WA225 , Thanks for reporting this issue. The sentencepiece converted tokenizer does not split long texts into smaller segments before applying BPE merges. This can lead to longer processing times for lengthy texts, though results can still be achieved. As a workaround, you can process the text sentence by sentence rather than in a single batch. We will incorporate code to handle text splitting into the tokenization process to address this issue soon.

WA225 commented 3 weeks ago

Thank you for your help @wenbingl

wenbingl commented 3 weeks ago

I have made a PR to avoid this slowness of a long text, https://github.com/microsoft/onnxruntime-extensions/pull/799. After it is integrated into genai, you can have a try again.