mlc-ai / tokenizers-cpp

Universal cross-platform tokenizers binding to HF and sentencepiece
Apache License 2.0
211 stars 47 forks source link

Add support for querying vocabulary from tokenizer #22

Closed Ubospica closed 6 months ago

Ubospica commented 6 months ago

This PR adds these methods to the Tokenizer class to support querying vocabulary from tokenizer. This supports downstream uses such as stopstring checking, grammar checking, etc.

  /*!
   * \brief Returns the vocabulary size. Special tokens are considered.
   */
  virtual size_t GetVocabSize() = 0;
  /*!
   * \brief Convert the given id to its corresponding token if it exists. If not, return an
   * empty string.
   */
  virtual std::string IdToToken(int32_t token_id) = 0;
  /*!
   * \brief Convert the given token to its corresponding id if it exists. If not, return -1.
   */
  virtual int32_t TokenToId(const std::string& token) = 0;

Tokenizer build time:

Tokenizer: SentencePiece
Load time: 5 ms

Tokenizer: Huggingface
Load time: 30 ms

Tokenizer: RWKVWorld
Load time: 113 ms
tqchen commented 6 months ago

let us directly call id_to_token, see related APIs

This would avoid the post processing done by the decode pipeline

tqchen commented 6 months ago

for the rust binding, we can store the result string in the wrapper and reuse https://github.com/mlc-ai/tokenizers-cpp/blob/main/include/tokenizers_c.h#L31

tqchen commented 6 months ago
std::string IdToToken(int32_t token_id);
Ubospica commented 6 months ago

cc @tqchen

Ubospica commented 6 months ago

cc @tqchen