huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

Fix some bugs #100

Closed jordane95 closed 8 months ago

jordane95 commented 8 months ago
jordane95 commented 8 months ago

Now dp should be the third dimension of parallel ranks

Could you please elaborate on this?

One missing argument

Where is it 🤔

  1. It is quite straightforward by looking at the definition of rank matrix https://github.com/huggingface/nanotron/blob/ff3c7746577948743da08c4868aca46cbc0c110b/src/nanotron/parallel/context.py#L69-L76

  2. Missing is_expert_sharded variable would cause referenced before defined error

https://github.com/huggingface/nanotron/blob/ff3c7746577948743da08c4868aca46cbc0c110b/src/nanotron/serialize/weights.py#L72-L95