shawntan / scattermoe

Triton-based implementation of Sparse Mixture of Experts.
Apache License 2.0
170 stars 13 forks source link

Model with balanced load runs slower than the imbalanced #10

Closed CanyonWind closed 5 months ago

CanyonWind commented 5 months ago

Hi, sorry for spamming the issues. Another thing I noticed when using scattermoe is that, as title says, the model enforced with load balancing runs lightly slower than the model with imbalanced load. Could you please sharing any insights whether that should be expected?

The speed difference isn't significant but fairly consistent -- model A is always 2-3% slower. In large scale training, minor speed difference still matters a lot. Would appreciate any thought on this.

Two identical models:

shawntan commented 5 months ago

Unsure, but I would say this seems expected, especially if you can skip the computation of some experts when no tokens are routed to it.

Generally better to have load balancing on, otherwise you could be severely underutilising some experts, or not using them at all.

CanyonWind commented 5 months ago

especially if you can skip the computation of some experts when no tokens are routed to it.

Thanks, but as scattermoe doesn't set expert capacity, shouldn't the computation cost theoretically always be the same no matter the load is balanced or not? As long as number of total tokens and expert dimensions are the same, scattermoe seems to have no difference regarding to these two scenarios' total computation.

I''m not questioning whether turning on lb, of which it should be for better quality; but wondering why the actual speed shows the difference.

shawntan commented 5 months ago

As far as my understanding of GPU programming goes, there can likely be differences due to caching. As a result, reusing an expert should be faster because the weight block is already in the cache.

Without more information that'd be my guess for where the 'speedup' might be coming from.