huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.32k stars 27.08k forks source link

RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long #34573

Open qmin2 opened 2 weeks ago

qmin2 commented 2 weeks ago

System Info

transformers == 4.45 torch == 2.4.1 + cu118 accelerate == 1.0.1

Who can help?

No response

Information

Tasks

Reproduction

dataset = load_dataset("pg19")
dataloader = {
    split: DataLoader(dataset[split], batch_size=args.batch_size, shuffle=(split == 'train'),
                      pin_memory=True) for split in ['train', 'validation', 'test']}

accelerator = Accelerator()
device = accelerator.device
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # tokenizer.pad_token = tokenizer.eos_token e.g.
model = LlamaForCausalLM.from_pretrained(model_name, config=config, torch_dtype = torch.bfloat16).to(device)
model.resize_token_embeddings(len(tokenizer))
train_dataloader, eval_dataloader, model, optimizer, lr_scheduler = accelerator.prepare(
    dataloader["train"], dataloader["validation"], model, optimizer, lr_scheduler
    )

for epoch in range(1, args.num_epochs + 1):
        start_time = perf_counter()

        model.train()
        train_loss = 0

        for idx, batch in enumerate(tqdm(train_dataloader, disable=args.disable_tqdm)):
            inputs = tokenizer(batch['text'], padding="longest", truncation=True, max_length=2200, return_tensors='pt', return_token_type_ids=False).to(device)

            inputs['labels'] = inputs['input_ids'].clone()

            label_mask = inputs['attention_mask'].bool()
            inputs['labels'][~label_mask] = -100

            loss = model(**inputs).loss

            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

Expected behavior

I'm using PyTorch 2.4.1 +cu118 and transformers 4.45, training with a batch size of 2 with 2 nvidia A100-80GB. When padding appeared in a batch, the attention_mask in LlamaSdpaAttention was activated(i.e. not None at this step).

causal_mask = attention_mask
if attention_mask is not None:
    causal_mask = causal_mask[:, :, :, : key_states.shape[-2]]

After performing the torch.nn.functional.scaled_dot_product_attention operation, I encountered the following error at this line accelerator.backward(loss)

RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long For now, I’ve resolved this by skipping batches that include padding, but I would like to understand the root cause and potential solutions for this issue.

Rocketknight1 commented 2 weeks ago

cc @muellerzr @SunMarc !

SunMarc commented 2 weeks ago

Hey @qmin2, can you share your accelerate config ? I see in other posts the same issue as you are facing, maybe this is relevant.

qmin2 commented 1 week ago

Sorry for the late reply.

this is my accelerate config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /home/qmin2/3rd_semester_research/mixed_tokens/ds_config.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

this is my deepspeed_config

{
    "bf16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-5,
            "weight_decay": 1e-5,
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupCosineLR",
        "params":{
            "total_num_steps" : 7500,
            "warmup_min_ratio" : 0.1
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

And I encounter another similar issue. I am using a custom 4D attention mask in LlamaForCausalLM and passing it as input. The model(Llama3.1) is configured with bfloat16. I am encountering an issue with scaled_dot_product_attention in the following line:

attn_output = torch.nn.functional.scaled_dot_product_attention(
    query_states,
    key_states,
    value_states,
    attn_mask=causal_mask,
    dropout_p=self.attention_dropout if self.training else 0.0,
    is_causal=is_causal,
)

The error message I get is a dtype mismatch between query_states and attention_bias. To resolve this, I converted my custom attention_mask to bfloat16 to match the llama3.1 model's dtype. After making this change, the previous error disappears, but a new issue arises during the backward pass with accelerator.backward(loss):

RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long

I suspect that this issue is related to the activation of the causal_mask in LlamaSdpaAttention. The same error occurs when padding is present in the input, and the causal mask is activated.