Closed oroojlooy closed 11 months ago
Hi @oroojlooy !
Thanks for the issue, I think you should rather run the script with accelerate launch
, first run:
accelerate config
And make sure to select multi-node setup!
cc @lvwerra that has some experience in multi-node training using trl
Hi @younesbelkada!
I am not using both gpus, so was not sure if I need to utilize accelerate launch
. I am getting the error with --nproc_per_node=1
.
Also, I got the run command from the README of the corresponding example in TRL package.
Do you think the issue is because I have two gpu nodes available in the machine? If so, does setting CUDA_VISIBLE_DEVICES
would help?
I don't think it is related to accelerate launch, I met the same issue while using GPT2 or GPT2-medium models.
@bingjie3216 @lvwerra @younesbelkada Would you mind sharing the version of python packages (like torch, accelerate, deepspeed, transfomers, etc) that you have and TRL examples work for you?
I am also running into this error with reward_summarization.py using the following command:
python -m torch.distributed.run --nproc_per_node=1 /dccstor/srosent2/trl/trl/examples/summarization/scripts/reward_summarization.py --bf16
python=3.10.0, torch=2.0.0, transformers=4.28.1, cuda 12
I enabled anomaly detection and it complained about this line in modeling_gpt2.py
line 201: attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
Any suggestions?
@seirasto torch.autograd.set_detect_anomaly(True)
gives the same line to me as well.
It looks like we are facing the exact same issue - are you using all the same version of packages? It would be great if someone could share with us ones that work.
I running it in python 3.8.16 and cuda 11.7. My package versions are:
I was able to get around the bug by modifying the problematic line in modeling_gpt2.py to use clone() so no inplace operations are occurring in modeling_gpt2.py
:
attn_weights = torch.where(causal_mask.clone(), attn_weights.to(attn_weights.dtype).clone(), mask_value)
@seirasto Thanks for letting me know!
Do you have any intuition why the clone()
on causal_mask()
is required? That does not have any relationship with attn_weights
so it seems that it should not affect the gradient in there.
No, but I tried with clone()
just on attn_weights
and it didn't work. I haven't tried doing the clone()
for just causal_mask()
I tried it on everything except causal_mask()
and it did not work. That is why I asked for the intuition about that.
same error
I don't have a clear understanding to the cause of this issue per se, but the problem is derived from the fact that we run two forward passes (for rewards_j
and rewards_k
respectively) to compute the loss function, and somehow GPT's doesn't like that. Here's a minimal workaround that doesn't involve making changes to transformers.models
:
Replace the current RewardDataCollatorWithPadding
with the following. We merge the two batches into one.
@dataclass
class RewardDataCollatorWithPadding:
tokenizer: AutoTokenizer
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt"
def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
merged_features = []
# features_j = []
# features_k = []
for feature in features:
merged_features.append(
{
"input_ids": feature["input_ids_j"],
"attention_mask": feature["attention_mask_j"],
}
)
merged_features.append(
{
"input_ids": feature["input_ids_k"],
"attention_mask": feature["attention_mask_k"],
}
)
batch = self.tokenizer.pad(
merged_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=self.return_tensors,
)
batch = {
"input_ids": batch["input_ids"],
"attention_mask": batch["attention_mask"],
"return_loss": True,
}
return batch
compute_loss
with the following. We split model predictions back to rewards_j
and rewards_k
after a single forward pass, and compute the loss function.
class RewardTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
rewards = model(
input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)[0]
bsz = rewards.size(0)
jidx = torch.arange(0, bsz, 2)
kidx = jidx + 1
rewards_j = rewards[jidx]
rewards_k = rewards[kidx]
loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
if return_outputs:
return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
return loss
This should work for GPT-2's and GPT-NeoX's!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Just noting that I am also hitting the same in-place issues with the same models, and (very oddly) that is only happens when using DDP. Single-GPU, single-node raises no error.
Planning to deep dive in the next weeks about issues with respect to distributed training, assigning this to myself
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi there, I believe this is now fixed on transformers, trl and peft main, please have a look at this comment: https://github.com/huggingface/trl/issues/835#issuecomment-1787704901 on how to fix the issue
I am getting the following error traceback when I run
python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
on a machine with two nodes of A10 (24GB). I havetorch==2.0.0
installed.I appreciate any comment/idea to fix that.