mlvlab / Flipped-VQA

Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)
https://ikodoh.github.io/flipped_vqa_demo.html
MIT License
70 stars 8 forks source link

Loss of `qav` and `vaq` came to `nan` quickly. #23

Open chuanwise opened 1 month ago

chuanwise commented 1 month ago
Not using distributed mode
[18:17:54.955532] job dir: /home/23031212503/projects/Flipped-VQA
[18:17:54.955618] Namespace(batch_size=1,
epochs=5,
accum_iter=4,
llama_model_path='./pretrained/llama/',
model='7B',
adapter_layer=32,
adapter_len=10,
max_seq_len=650,
max_feats=10,
weight_decay=0.02,
lr=None,
blr=0.07,
min_lr=0.0,
warmup_epochs=2,
dataset='tvqa',
output_dir='./checkpoint/tvqa',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=2,
pin_mem=True,
world_size=1,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
vaq=True,
qav=True,
bias=3.0,
tau=100.0,
sub=True,
distributed=False)
[18:18:16.740925] Num train data: 122039
[18:18:24.026051] Num val data: 15253
[18:18:24.039350] Using model: 7B
[18:18:24.041255] loading from pretrained/llama/7B/consolidated.00.pth
[18:19:13.553202] base lr: 7.00e-02
[18:19:13.553243] actual lr: 1.09e-03
[18:19:13.553254] accumulate grad iterations: 4
[18:19:13.553258] effective batch size: 4
[18:19:13.554187] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.00109375
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.00109375
    maximize: False
    weight_decay: 0.02
)
[18:19:13.554305] Start training for 5 epochs
[18:19:17.576096] Epoch: [0]  [     0/122039]  eta: 5 days, 16:15:56  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 4.0197  data: 0.7782  max mem: 37679
[18:19:23.617162] Loss is nan, stopping training

But according to the printed, loss is not nan.

Command is the training command in README with some arguments about distributed training removed:

python train.py --model 7B --max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa --blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav
chuanwise commented 1 month ago

I changed codes near engine.py:25 to:

vqa_loss, vaq_loss, qav_loss = model(data)
print(f"vqa_loss: {vqa_loss}, vaq_loss: {vaq_loss}, qav_loss: {qav_loss}")

And here is the log:

[19:52:42.854386] vqa_loss: 1.484375, vaq_loss: 1.8125, qav_loss: 2.3902640342712402
[19:52:43.300642] Epoch: [0]  [     0/122039]  eta: 4 days, 4:45:00  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 2.9720  data: 0.8503  max mem: 37679
[19:52:43.623051] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[19:52:44.408110] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[19:52:45.173799] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[19:52:45.960194] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[19:52:46.725876] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[19:52:47.497957] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[19:52:48.268652] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[19:52:49.053853] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[19:52:49.822032] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[19:52:49.826290] Loss is nan, stopping training
ikodoh commented 1 month ago

If you using one GPU rather than 8 GPUs, then I recommend using --accum_iter 32, since the batch size is decreased by 8 times. Or you may use the lower blr. The current loss seems to become diverged due to the large blr but with a small batch size.

chuanwise commented 1 month ago

I saw your comment last night and changed blr to 1e-4, but the problem still exist. Now I'm trying to use --accum_iter 32. 🤔

chuanwise commented 1 month ago

The problem still exist:

Not using distributed mode
[09:30:04.319382] job dir: /home/23031212503/projects/Flipped-VQA
[09:30:04.319483] Namespace(batch_size=1,
epochs=10,
accum_iter=32,
llama_model_path='./pretrained/llama/',
model='7B',
adapter_layer=32,
adapter_len=10,
max_seq_len=650,
max_feats=10,
weight_decay=0.02,
lr=None,
blr=0.0001,
min_lr=0.0,
warmup_epochs=2,
dataset='tvqa',
output_dir='./checkpoint/tvqa',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=2,
pin_mem=True,
world_size=1,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
vaq=True,
qav=True,
bias=3.0,
tau=100.0,
sub=True,
distributed=False)
[09:30:29.629968] Num train data: 122039
[09:30:37.490100] Num val data: 15253
[09:30:37.506190] Using model: 7B
[09:30:37.514400] loading from pretrained/llama/7B/consolidated.00.pth
[09:31:27.752191] base lr: 1.00e-04
[09:31:27.752239] actual lr: 1.25e-05
[09:31:27.752249] accumulate grad iterations: 32
[09:31:27.752253] effective batch size: 32
[09:31:27.753421] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 1.25e-05
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 1.25e-05
    maximize: False
    weight_decay: 0.02
)
[09:31:27.753595] Start training for 10 epochs
[09:31:31.378744] vqa_loss: 1.484375, vaq_loss: 1.8125, qav_loss: 2.3902640342712402
[09:31:31.814755] Epoch: [0]  [     0/122039]  eta: 5 days, 17:34:47  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 4.0584  data: 0.8839  max mem: 37679
[09:31:32.117008] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[09:31:32.972378] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[09:31:33.859110] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[09:31:34.591741] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[09:31:35.327070] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[09:31:36.051778] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[09:31:36.778938] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[09:31:37.510163] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[09:31:38.236347] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[09:31:38.242391] Loss is nan, stopping training

Here is the commands:

python train.py --model 7B \
    --max_seq_len 650 --batch_size 1 --epochs 10 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
    --blr 1e-4 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 32 --sub --vaq --qav
chuanwise commented 1 month ago

If run codes with autonomy detection enabled:

with torch.autograd.detect_anomaly():
    vqa_loss, vaq_loss, qav_loss = model(data)

It can't detect where the nan come from.

(video-question-answering) [23031212503@login01 Flipped-VQA]$ tail err.90258*
==> err.902582.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():

==> err.902583.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():

==> err.902584.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
(video-question-answering) [23031212503@login01 Flipped-VQA]$ tail out.90258*
==> out.902582.log <==
[09:43:20.634369] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[09:43:21.808103] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[09:43:22.990304] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[09:43:24.292305] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[09:43:25.474479] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[09:43:26.666428] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[09:43:27.858821] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[09:43:29.167579] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[09:43:30.356480] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[09:43:30.367149] Loss is nan, stopping training

==> out.902583.log <==
[09:48:09.904082] vqa_loss: 2.009765625, vaq_loss: 2.48828125, qav_loss: 2.3202672004699707
[09:48:11.400204] vqa_loss: 1.90234375, vaq_loss: 2.208984375, qav_loss: 2.3291122913360596
[09:48:12.899157] vqa_loss: 2.033203125, vaq_loss: 2.5859375, qav_loss: 2.3255414962768555
[09:48:14.365934] vqa_loss: 1.8984375, vaq_loss: 2.486328125, qav_loss: 2.328566789627075
[09:48:15.935122] vqa_loss: 1.8720703125, vaq_loss: 2.34375, qav_loss: 2.3454670906066895
[09:48:17.412503] vqa_loss: 1.9130859375, vaq_loss: 2.47265625, qav_loss: 2.3222508430480957
[09:48:18.888214] vqa_loss: 2.033203125, vaq_loss: 2.328125, qav_loss: 2.3213603496551514
[09:48:20.486417] vqa_loss: 1.91796875, vaq_loss: 2.65625, qav_loss: 2.34771728515625
[09:48:21.978990] vqa_loss: 1.939453125, vaq_loss: 2.6015625, qav_loss: nan
[09:48:21.999707] Loss is nan, stopping training

==> out.902584.log <==
[09:44:14.089460] Start training for 5 epochs
[09:44:16.338710] vqa_loss: 1.9150390625, vaq_loss: 1.8759765625, qav_loss: 2.407137393951416
[09:44:16.696111] Epoch: [0]  [   0/9233]  eta: 6:40:47  lr: 0.000000  loss: 6.1982 (6.1982)  vqa_loss: 1.9150 (1.9150)  vaq_loss: 1.8760 (1.8760)  qav_loss: 2.4071 (2.4071)  time: 2.6045  data: 0.1838  max mem: 32770
[09:44:17.516800] vqa_loss: 1.984375, vaq_loss: 1.0400390625, qav_loss: 2.2512660026550293
[09:44:18.640382] vqa_loss: 1.8154296875, vaq_loss: 1.4052734375, qav_loss: 2.313685655593872
[09:44:19.772095] vqa_loss: 1.72265625, vaq_loss: 1.5458984375, qav_loss: 2.22432017326355
[09:44:21.041153] vqa_loss: 1.9140625, vaq_loss: 1.4482421875, qav_loss: 2.2694244384765625
[09:44:22.171438] vqa_loss: 1.8427734375, vaq_loss: 2.1640625, qav_loss: 2.3169543743133545
[09:44:23.298976] vqa_loss: 1.81640625, vaq_loss: 1.9326171875, qav_loss: nan
[09:44:23.310836] Loss is nan, stopping training

🤔

ikodoh commented 1 month ago
Screenshot 2024-08-01 at 11 13 03 PM

In my environment, the model is trained well with the below command:

python train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

Please make sure the environment of your machine is the same as the README.md.