[distributed] integrate chat tokenizers, and add llama3-8B model option

pytorch / torchchat

Run PyTorch LLMs locally on servers, desktop and mobile

BSD 3-Clause "New" or "Revised" License

3.13k stars 196 forks source link

This PR: 1 - integrates the chat Tokenizers by using the TokenizerArgs and _inititialize_tokenizer functions from builder.py. With the _build_chat_tokenizer() you can instantiate the same tokenizers as installed by chat (rather than using HF tokenizer).

example:

[rank0]:2024-09-05:15:57:26,835 INFO     [dist_run.py:83] using tokenizer = tokenizer.tiktoken.Tokenizer

and

[rank0]:2024-09-05:16:01:01,716 INFO     [dist_run.py:83] using tokenizer = sentencepiece.SentencePieceProcessor

2 - adds llama3-8B instructional as a valid model for dist.

pytorch / torchchat

[distributed] integrate chat tokenizers, and add llama3-8B model option #1110

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1110

:white_check_mark: No Failures