FlanT5 training and zero tensors - Githubissues

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Apache License 2.0

37k stars 4.56k forks source link

FlanT5 training and zero tensors #1339

Open GenVr opened 1 year ago

GenVr commented 1 year ago

Hi, I'm training a FlanT5 network. The training completes successfully, but when I try to run a simple inference, I have a tensor of zeros, so the prediction is null.

Example:

tokenizer = AutoTokenizer.from_pretrained(path, use_fast=False)
model = T5ForConditionalGeneration.from_pretrained(path, low_cpu_mem_usage=True, torch_dtype=torch.float16).cuda()

tokenized_text = tokenizer(query, return_tensors="pt")

source_ids = tokenized_text["input_ids"].to(device, dtype=torch.long)

generated_ids = model.generate(input_ids=source_ids)

Output:

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       device='cuda:0')

I tried to run several trainings, both on flanT5-xl and flanT5-large, both on my personal dataset and a dummy.json dataset.

That's my training configuration:

!python3 -m torch.distributed.run --nproc_per_node=6 fastchat/train/train_flant5.py \
    --model_name_or_path google/flan-t5-xl \
    --data_path playground/data/dummy.json \
    --fp16 True \
    --output_dir ./output \
    --num_train_epochs 5 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 99999 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp_transformer_layer_cls_to_wrap T5Block \
    --tf32 False \
    --fsdp "full_shard auto_wrap" \
    --model_max_length 256 \
    --gradient_checkpointing True \
    --preprocessed_path ./preprocessed_data/processed.json

Any idea what's going on? Thank you.

merrymercy commented 1 year ago

cc @DachengLi1

DachengLi1 commented 1 year ago

@GenVr This is likely because Pytorch FSDP saves t5 model incorrectly (if you print out the loaded model weight, the encoder embedding or decoder embedding is likely 0, which causing the final predictions to be all 0). Can you try using our postprocessing function? There is another issue on this solving the same problem. Let me know if it works!

GenVr commented 1 year ago

@DachengLi1 Thanks, I trained on GPUs with more memory and used the function after the training. I am able to load the model correctly. Now, I have another problem. During the training, I have a zero loss and learning rate:

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}                              
...                         
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.32}                              
...                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.67}                              
...                        
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 1.02} 
...

It seems that the network can't learn anything. My configurations are written in the initial post. I trained both on the dummy.json dataset and on a personal one, having the same results. Do you have any idea about it? Thanks.

DachengLi1 commented 1 year ago

@GenVr I met a similar issue that learning rate is 0 in a small dataset. This is because some integer flooring behavior in the huggingface transformer. Can you try warmup ratio =0 (or not give this argument), and let me know what happens?

GenVr commented 1 year ago

@DachengLi1 Thanks. I tried removing --warmup_ratio 0.03, I had this:

...                          
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.07}
...                                                                            
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.31}
...                                                   
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.52}
...

Now I have the LR, but the loss is always zero. Changing the batch size to 1, I can see the loss sometimes different from zero. I tried to change the learning rate to 1e3 but after the first epoch the situation remains the same.

DachengLi1 commented 1 year ago

@GenVr Nice to hear that! Let's keep bs=1 for now, I will look into whether bs>1 can cause other problems (Haven't really tested bs>1 because of the GPU memory limit). Can you try bs=1 on your own dataset? The dummy dataset is composed of very simple questions (if you look into it, a lot of them are very similar), so you probably want to see whether this still happens in a more complex dataset.

DachengLi1 commented 1 year ago

BTW, remember to change the preprocessed path, otherwise it will read from the file.

GenVr commented 1 year ago

@DachengLi1 Thanks. I tried both with BS equal to 1 and greater than one, with my personal dataset. The loss is always zero and it seems the network fails to train (looks untrained). I could try maybe a big public .json dataset to see what happens (?)

DachengLi1 commented 1 year ago

@GenVr Interesting.. I haven't seen this before, could you print an input/target tensor before it goes into the trainer to see what are the contents? Is the data processed in a wrong way?

emnlpanon commented 1 year ago

same problem (0 loss from start to finish), both dummy.json and my own dataset

richagadgil commented 1 year ago

Was this resolved? same problem with a 0 loss.

leng-yue commented 1 year ago

same problem

jxmorris12 commented 1 year ago

same problem