Open qmin2 opened 2 weeks ago
cc @muellerzr @SunMarc !
Hey @qmin2, can you share your accelerate config ? I see in other posts the same issue as you are facing, maybe this is relevant.
Sorry for the late reply.
this is my accelerate config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: /home/qmin2/3rd_semester_research/mixed_tokens/ds_config.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
this is my deepspeed_config
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-5,
"weight_decay": 1e-5,
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupCosineLR",
"params":{
"total_num_steps" : 7500,
"warmup_min_ratio" : 0.1
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
And I encounter another similar issue. I am using a custom 4D attention mask in LlamaForCausalLM and passing it as input. The model(Llama3.1) is configured with bfloat16. I am encountering an issue with scaled_dot_product_attention in the following line:
attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states,
key_states,
value_states,
attn_mask=causal_mask,
dropout_p=self.attention_dropout if self.training else 0.0,
is_causal=is_causal,
)
The error message I get is a dtype mismatch between query_states and attention_bias. To resolve this, I converted my custom attention_mask to bfloat16 to match the llama3.1 model's dtype. After making this change, the previous error disappears, but a new issue arises during the backward pass with accelerator.backward(loss):
RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long
I suspect that this issue is related to the activation of the causal_mask in LlamaSdpaAttention. The same error occurs when padding is present in the input, and the causal mask is activated.
System Info
transformers == 4.45 torch == 2.4.1 + cu118 accelerate == 1.0.1
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I'm using PyTorch 2.4.1 +cu118 and transformers 4.45, training with a batch size of 2 with 2 nvidia A100-80GB. When padding appeared in a batch, the attention_mask in LlamaSdpaAttention was activated(i.e. not None at this step).
After performing the torch.nn.functional.scaled_dot_product_attention operation, I encountered the following error at this line
accelerator.backward(loss)
RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long
For now, I’ve resolved this by skipping batches that include padding, but I would like to understand the root cause and potential solutions for this issue.