huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

[Feature] Refactor ParallelContext.world_rank_matrix #94

Closed 0xkerem closed 6 months ago

0xkerem commented 6 months ago

Hi,

I've addressed Issue #77 by introducing the get_global_rank method instead of direct access to the world_rank_matrix attribute, as suggested. This change enhances code modularity and adheres to best practices.

I've thoroughly tested the modifications to ensure stability. Your feedback is appreciated.

- parallel_context.world_rank_matrix[
-     dist.get_rank(parallel_context.expert_pg),
-     get_pp_rank_of(target, module=mdl),
-     dist.get_rank(parallel_context.dp_pg),
-     dist.get_rank(parallel_context.tp_pg),
- ],
+ parallel_context.get_global_rank(
+     expert_parallel_rank=dist.get_rank(parallel_context.expert_pg),
+     pipeline_parallel_rank=get_pp_rank_of(target, module=mdl),
+     data_parallel_rank=dist.get_rank(parallel_context.dp_pg),
+     tensor_parallel_rank=dist.get_rank(parallel_context.tp_pg),
+ ),
xrsrke commented 6 months ago

@0xkerem very nice. Thanks for the PR. could you write a unit test for parallel_context.get_global_rank please?

NouamaneTazi commented 6 months ago

Hello I see some tests failing because of circular imports. It should be fixed on main now :)

0xkerem commented 6 months ago

Hello I see some tests failing because of circular imports. It should be fixed on main now :)

I saw the notification that the test failed on my phone, I came home a little late but now I'm on the computer. Thanks for reporting the reason for the failed test :)

0xkerem commented 6 months ago

https://github.com/huggingface/nanotron/blob/ff3c7746577948743da08c4868aca46cbc0c110b/src/nanotron/parallel/context.py#L128-L130

@NouamaneTazi Hi! Could there be a problem with these lines? The function is defined to return a 3-element tuple, but this code appears to return 4 elements. It also says without expert_parallel_rank in the comment line, but there is nothing to fulfill this condition. Or is there something I missed?

0xkerem commented 6 months ago

@xrsrke I think I finally passed the tests I added. There is currently an error left, which is from the main branch and not caused by me. Probably I solved it, but I removed the line I added in my last commit, thinking it might not be needed. I'll re-add that line and try one more time.

NouamaneTazi commented 6 months ago

Amazing job! Thanks for the great PR and your reactivity!! ❤️

0xkerem commented 6 months ago

Amazing job! Thanks for the great PR and your reactivity!! ❤️

You're welcome, and I thank you too! This process and your feedback helped me a lot. :)