mlc-ai / xgrammar

Efficient, Flexible and Portable Structured Generation
https://xgrammar.mlc.ai/
Apache License 2.0
380 stars 18 forks source link

[Tokenizer] Let stop_token_ids be the eos_token_id of tokenizer by default #96

Closed Ubospica closed 6 days ago

Ubospica commented 6 days ago

This PR sets the default stop_token_ids of TokenizerInfo be the eos_token_id of the huggingface tokenizer.

Previously we auto-detect the stop token ids based on a set of builtin stop token strings. However, some downstream frameworks may not recognize the stop token ids detected by us.

After this PR, the auto-detection only happens when the tokenizer does not have a eos_token_id defined.