huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

Enable masking when tp=1 #160

Closed YongjunHe closed 1 month ago

YongjunHe commented 6 months ago

In the config_tiny_llama.py example, the model's vocab_size is 256, while the GPT-2 tokenizer's vocabulary size is 50257. This incompatibility is avoided by masking when tp>1, but will cause an error when tp=1. Therefore, this PR ensure that masking is also enabled when tp=1.