Open awzhgw opened 9 months ago
@tohtana can you help me ???
same here
same here
Found the potential cause, some experts during training don't see any token, thus no gradients, all other processes will get stucked. After feed fake gradient to experts that don't see any token, the training goes smooth.
Found the potential cause, some experts during training don't see any token, thus no gradients, all other processes will get stucked. After feed fake gradient to experts that don't see any token, the training goes smooth.找到了潜在的原因,一些专家在训练期间没有看到任何标记,因此没有梯度,所有其他过程都会被卡住。将假梯度提供给看不到任何标记的专家后,训练就会顺利进行。
Could you please provide an example on how to feed fake gradient to experts? Much appreciated! @hanxiaotian
Found the potential cause, some experts during training don't see any token, thus no gradients, all other processes will get stucked. After feed fake gradient to experts that don't see any token, the training goes smooth.找到了潜在的原因,一些专家在训练期间没有看到任何标记,因此没有梯度,所有其他过程都会被卡住。将假梯度提供给看不到任何标记的专家后,训练就会顺利进行。
Could you please provide an example on how to feed fake gradient to experts? Much appreciated! @hanxiaotian
something like below modification in HF Mixtral implementation
for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
if top_x.shape[0] == 0 and self.training:
if self.training:
top_x_ = torch.zeros(1).to(hidden_states.device).to(torch.int32)
top_x_list = top_x_.tolist()
current_state = hidden_states[None, top_x_list].reshape(
-1, hidden_dim
)
fake_state = expert_layer(current_state * 0)
final_hidden_states.index_add_(
0, top_x_, fake_state.to(hidden_states.dtype)
)
else:
continue
else:
# in torch it is faster to index using lists than torch tensors
top_x_list = top_x.tolist()
idx_list = idx.tolist()
# Index the correct hidden states and compute the expert hidden state for
# the current expert. We need to make sure to multiply the output hidden
# states by `routing_weights` on the corresponding tokens (top-1 and top-2)
current_state = hidden_states[None, top_x_list].reshape(-1, hidden_dim)
current_hidden_states = (
expert_layer(current_state)
* routing_weights[top_x_list, idx_list, None]
)
# However `index_add_` only support torch tensors for indexing so we'll use
# the `top_x` tensor here.
final_hidden_states.index_add_(
0, top_x, current_hidden_states.to(hidden_states.dtype)
)
Hope this can help.
The NCCL timed out while using the zero3 model. How can I solve this problem?
I inherited the large model Mixtral 7BX8 and utilized the Llama architecture, augmenting it with multi-modal capabilities for video and audio.
The architecture of my model is as follows:
After initializing the model, I have already called deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock]) print('model z3_leaf_model is ',deepspeed.utils.get_z3_leaf_modules(model))
The printed result is as follows.:
The training process is as follows: Scenario 1: When I use zero3 for deepspeed training, if the training data source only contains images, there are no issues, and training can proceed safely.
Scenario 2: When I use zero3 for deepspeed training, if the training data source contains both images and videos, it will get stuck after 270 steps, with an ongoing NCCL timeout.
The error message is as follows.
During the period when NCCL got stuck, I obtained the point at which the Python process became stuck.: