google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.06k stars 1.16k forks source link

Please specify which model file to use with PaLM2 models #910

Closed opyate closed 1 year ago

opyate commented 1 year ago

As PaLM2 is Google's flagship LLM offering, it makes sense to spend a few lines in the README talking about the SentencePiece models used with the likes of text-bison, code-bison, etc. (Just like OpenAI does with tiktoken.)

The Vertex docs specify token limits in training data, but doesn't offer any advice on how to count those tokens: https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-text-models#dataset-format

The paper says that PaLM uses SentencePiece: https://arxiv.org/pdf/2204.02311.pdf

But which model?

import sentencepiece as spm
s = spm.SentencePieceProcessor(model_file='???.model')
len(s.EncodeAsPieces(my_training_datum))
opyate commented 1 year ago

Workaround seems to be https://cloud.google.com/vertex-ai/docs/generative-ai/get-token-count ...but it would be great if I can do this myself using SentencePiece, and not having to make 1K+ API calls to check each training datum.

taku910 commented 1 year ago

Questions or requests for software or systems using SentencePiece are not accepted here. Please ask directly to these developers.

opyate commented 10 months ago

Please ask directly to these developers.

I ask here because the developers are here, surely?

Please let me know the contact details so I can ask them.

All I need to know is what to fill in for '???' to be able to do token counts for text-bison@001:

s = spm.SentencePieceProcessor(model_file='???.model')