Closed opyate closed 1 year ago
Workaround seems to be https://cloud.google.com/vertex-ai/docs/generative-ai/get-token-count ...but it would be great if I can do this myself using SentencePiece, and not having to make 1K+ API calls to check each training datum.
Questions or requests for software or systems using SentencePiece are not accepted here. Please ask directly to these developers.
Please ask directly to these developers.
I ask here because the developers are here, surely?
Please let me know the contact details so I can ask them.
All I need to know is what to fill in for '???' to be able to do token counts for text-bison@001:
s = spm.SentencePieceProcessor(model_file='???.model')
As PaLM2 is Google's flagship LLM offering, it makes sense to spend a few lines in the README talking about the SentencePiece models used with the likes of text-bison, code-bison, etc. (Just like OpenAI does with tiktoken.)
The Vertex docs specify token limits in training data, but doesn't offer any advice on how to count those tokens: https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-text-models#dataset-format
The paper says that PaLM uses SentencePiece: https://arxiv.org/pdf/2204.02311.pdf
But which model?