Add support for querying vocabulary from tokenizer

Ubospica commented 6 months ago

This PR adds these methods to the Tokenizer class to support querying vocabulary from tokenizer. This supports downstream uses such as stopstring checking, grammar checking, etc.

  /*!
   * \brief Returns the vocabulary size. Special tokens are considered.
   */
  virtual size_t GetVocabSize() = 0;
  /*!
   * \brief Convert the given id to its corresponding token if it exists. If not, return an
   * empty string.
   */
  virtual std::string IdToToken(int32_t token_id) = 0;
  /*!
   * \brief Convert the given token to its corresponding id if it exists. If not, return -1.
   */
  virtual int32_t TokenToId(const std::string& token) = 0;

Tokenizer build time:

Tokenizer: SentencePiece
Load time: 5 ms

Tokenizer: Huggingface
Load time: 30 ms

Tokenizer: RWKVWorld
Load time: 113 ms

tqchen commented 6 months ago

let us directly call id_to_token, see related APIs

https://docs.rs/tokenizers/latest/tokenizers/tokenizer/struct.Tokenizer.html see id_to_token
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.h#L629
RWKV should have its own internal vocab

This would avoid the post processing done by the decode pipeline

tqchen commented 6 months ago

for the rust binding, we can store the result string in the wrapper and reuse https://github.com/mlc-ai/tokenizers-cpp/blob/main/include/tokenizers_c.h#L31

tqchen commented 6 months ago

std::string IdToToken(int32_t token_id);

Ubospica commented 6 months ago

cc @tqchen

Ubospica commented 6 months ago

cc @tqchen

mlc-ai / tokenizers-cpp

Add support for querying vocabulary from tokenizer #22