mistralai / mistral-common

Apache License 2.0
633 stars 57 forks source link

Update sentence piece & Update tokenizer table #23

Closed pandora-s-git closed 3 months ago

pandora-s-git commented 3 months ago

This PR will update to sentence-piece 0.2.0 as requested from other PRs, and update the README table with hopefully the right tokenizers for the most recent releases:

| Open Model | Tokenizer | |------------|-----------| | Mistral 7B Instruct v0.1 | v1 | | Mistral 7B Instruct v0.2 | v1 | | Mistral 7B Instruct v0.3 | v3 | | Mixtral 8x7B Instruct v0.1 | v1 | | Mixtral 8x22B Instruct v0.1 | v3 | | Mixtral 8x22B Instruct v0.3 | v3 | | Codestral 22B v0.1 | v3 | | Endpoint Model | Tokenizer | |---------------|-----------| | mistral-embed | v1 | | open-mistral-7b | v3 | | open-mixtral-8x7b | v1 | | open-mixtral-8x22b | v3 | | mistral-small-latest | v2 | | mistral-large-latest | v2 | | codestral-22b | v3 |
pandora-s-git commented 3 months ago

cc @patrickvonplaten sorry to bother, could you review this? with sentence-piece 0.1.99 mistral common breaks in recent python versions and a lot of envs, plus updating the tokenizers to hopefully the right versions for each model/endpoint