Open zws98 opened 6 months ago
Maybe you can consider if drop-less MOE mode can solve your issue, which is achieved by setting capacity_factor=0
The results are still diverse for each process and the results are different from setting capacity_factor=1.25.
Do you have more information? I didn't get what you said.
Outputs from different GPUs:
STEP-10: loss = 21.11541, step_time = 3.628716 sec, perf = 0.08 tflops.
[Summary] Average synchronized step_time = 0.3628715753555298 sec.
STEP-10: loss = 21.11541, step_time = 3.670310 sec, perf = 0.07 tflops.
[Summary] Average synchronized step_time = 0.36703104972839357 sec.
STEP-10: loss = 21.11541, step_time = 3.689584 sec, perf = 0.07 tflops.
[Summary] Average synchronized step_time = 0.3689584493637085 sec.
STEP-10: loss = 21.11541, step_time = 3.675405 sec, perf = 0.07 tflops.
[Summary] Average synchronized step_time = 0.36754045486450193 sec.
STEP-10: loss = 21.11541, step_time = 3.681213 sec, perf = 0.07 tflops.
[Summary] Average synchronized step_time = 0.36812126636505127 sec.
STEP-10: loss = 21.11541, step_time = 3.629702 sec, perf = 0.08 tflops.
[Summary] Average synchronized step_time = 0.3629701852798462 sec.
STEP-10: loss = 21.11541, step_time = 3.700365 sec, perf = 0.07 tflops.
[Summary] Average synchronized step_time = 0.37003653049468993 sec.
STEP-10: loss = 21.11541, step_time = 3.658189 sec, perf = 0.08 tflops.
[Summary] Average synchronized step_time = 0.3658188819885254 sec.
I trained MOE on 8 gpus with 8 experts. When I conducted the inference in parallel, I found each process had a similar but different result. I would like to ask you what could be the cause of this?