Open MIkumikumi0116 opened 1 year ago
Updata:
I add a line:
input_token = torch.randint_like(input_token, low = 0, high = vocab_size, dtype = torch.int64, device = engine.device)
make the train code to be:
for input_token, target_token, _ in train_dataloader:
input_token = input_token. to(engine.device)
target_token = target_token.to(engine.device)
input_token = torch.randint_like(input_token, low = 0, high = vocab_size, dtype = torch.int64, device = engine.device)
pred_logits = engine(input_token)
target_pos = (target_token != PAD_ID) # only calculate loss on non-pad token
pred_logits = pred_logits[ target_pos].view(-1, model_arg.vocab_size)
target_token = target_token[target_pos].view(-1)
loss = torch.nn.functional.cross_entropy(pred_logits, target_token)
engine.backward(loss)
engine.step()
And the error Trying to backward through the graph a second time
still exists. I can't think of a way to explain it
Mistakenly clicked on close.
I meet the same issue, huggingface version is 4.28.1, deepspeed version is 0.9.1
My case is when finetuning Llama-2 with deepspeed enabled, the above reported error occurs.
Describe the bug When I am training my model, everything goes well in the first batch, but an error occurs in the second batch:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
To Reproduce I am fine-tuning llama2 using my own implementation of LoRA. Each lora linear in my llama2 has different LoRA rank. Each linear in my model is replaced to
My dataset is
Collate function to pad tokens of a batch to same length
Initialize model and deepspeed
Deepspeed config
Main code of training
Expected behavior Fix the error
Trying to backward through the graph a second time
ds_report output
Screenshots No Screenshots
System info (please complete the following information): OS: Ubuntu 20.04.4 GPU: 1x RTX 3090ti Interconnects: No Python version: 3.11.4
Launcher context
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 deepspeed --include localhost:0 train.py
Docker context No docker
Additional context Complete deepspeed output In Reproduce, I simplified some of the code, so the error traceback is slightly different from Reproduce.
I would appreciate any help.