Open chuanwise opened 1 month ago
I changed codes near engine.py:25
to:
vqa_loss, vaq_loss, qav_loss = model(data)
print(f"vqa_loss: {vqa_loss}, vaq_loss: {vaq_loss}, qav_loss: {qav_loss}")
And here is the log:
[19:52:42.854386] vqa_loss: 1.484375, vaq_loss: 1.8125, qav_loss: 2.3902640342712402
[19:52:43.300642] Epoch: [0] [ 0/122039] eta: 4 days, 4:45:00 lr: 0.000000 loss: 5.6871 (5.6871) vqa_loss: 1.4844 (1.4844) vaq_loss: 1.8125 (1.8125) qav_loss: 2.3903 (2.3903) time: 2.9720 data: 0.8503 max mem: 37679
[19:52:43.623051] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[19:52:44.408110] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[19:52:45.173799] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[19:52:45.960194] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[19:52:46.725876] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[19:52:47.497957] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[19:52:48.268652] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[19:52:49.053853] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[19:52:49.822032] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[19:52:49.826290] Loss is nan, stopping training
If you using one GPU rather than 8 GPUs, then I recommend using --accum_iter 32
, since the batch size is decreased by 8 times. Or you may use the lower blr
. The current loss seems to become diverged due to the large blr
but with a small batch size.
I saw your comment last night and changed blr
to 1e-4
, but the problem still exist.
Now I'm trying to use --accum_iter 32
. 🤔
The problem still exist:
Not using distributed mode
[09:30:04.319382] job dir: /home/23031212503/projects/Flipped-VQA
[09:30:04.319483] Namespace(batch_size=1,
epochs=10,
accum_iter=32,
llama_model_path='./pretrained/llama/',
model='7B',
adapter_layer=32,
adapter_len=10,
max_seq_len=650,
max_feats=10,
weight_decay=0.02,
lr=None,
blr=0.0001,
min_lr=0.0,
warmup_epochs=2,
dataset='tvqa',
output_dir='./checkpoint/tvqa',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=2,
pin_mem=True,
world_size=1,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
vaq=True,
qav=True,
bias=3.0,
tau=100.0,
sub=True,
distributed=False)
[09:30:29.629968] Num train data: 122039
[09:30:37.490100] Num val data: 15253
[09:30:37.506190] Using model: 7B
[09:30:37.514400] loading from pretrained/llama/7B/consolidated.00.pth
[09:31:27.752191] base lr: 1.00e-04
[09:31:27.752239] actual lr: 1.25e-05
[09:31:27.752249] accumulate grad iterations: 32
[09:31:27.752253] effective batch size: 32
[09:31:27.753421] AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.95)
capturable: False
eps: 1e-08
foreach: None
lr: 1.25e-05
maximize: False
weight_decay: 0.0
Parameter Group 1
amsgrad: False
betas: (0.9, 0.95)
capturable: False
eps: 1e-08
foreach: None
lr: 1.25e-05
maximize: False
weight_decay: 0.02
)
[09:31:27.753595] Start training for 10 epochs
[09:31:31.378744] vqa_loss: 1.484375, vaq_loss: 1.8125, qav_loss: 2.3902640342712402
[09:31:31.814755] Epoch: [0] [ 0/122039] eta: 5 days, 17:34:47 lr: 0.000000 loss: 5.6871 (5.6871) vqa_loss: 1.4844 (1.4844) vaq_loss: 1.8125 (1.8125) qav_loss: 2.3903 (2.3903) time: 4.0584 data: 0.8839 max mem: 37679
[09:31:32.117008] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[09:31:32.972378] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[09:31:33.859110] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[09:31:34.591741] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[09:31:35.327070] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[09:31:36.051778] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[09:31:36.778938] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[09:31:37.510163] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[09:31:38.236347] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[09:31:38.242391] Loss is nan, stopping training
Here is the commands:
python train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 10 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 1e-4 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 32 --sub --vaq --qav
If run codes with autonomy detection enabled:
with torch.autograd.detect_anomaly():
vqa_loss, vaq_loss, qav_loss = model(data)
It can't detect where the nan
come from.
(video-question-answering) [23031212503@login01 Flipped-VQA]$ tail err.90258*
==> err.902582.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with torch.autograd.detect_anomaly():
==> err.902583.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with torch.autograd.detect_anomaly():
==> err.902584.log <==
/home/23031212503/projects/Flipped-VQA/engine.py:25: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with torch.autograd.detect_anomaly():
(video-question-answering) [23031212503@login01 Flipped-VQA]$ tail out.90258*
==> out.902582.log <==
[09:43:20.634369] vqa_loss: 1.6328125, vaq_loss: 3.232421875, qav_loss: 2.3078742027282715
[09:43:21.808103] vqa_loss: 1.8759765625, vaq_loss: 2.876953125, qav_loss: 2.1542460918426514
[09:43:22.990304] vqa_loss: 1.55078125, vaq_loss: 1.603515625, qav_loss: 2.2987637519836426
[09:43:24.292305] vqa_loss: 1.5166015625, vaq_loss: 2.427734375, qav_loss: 2.203843355178833
[09:43:25.474479] vqa_loss: 1.6318359375, vaq_loss: 2.048828125, qav_loss: 2.2658228874206543
[09:43:26.666428] vqa_loss: 1.5791015625, vaq_loss: 1.6123046875, qav_loss: 2.2287609577178955
[09:43:27.858821] vqa_loss: 1.673828125, vaq_loss: 1.6904296875, qav_loss: 2.201247215270996
[09:43:29.167579] vqa_loss: 1.3828125, vaq_loss: 2.248046875, qav_loss: 2.5104265213012695
[09:43:30.356480] vqa_loss: 1.998046875, vaq_loss: 2.029296875, qav_loss: nan
[09:43:30.367149] Loss is nan, stopping training
==> out.902583.log <==
[09:48:09.904082] vqa_loss: 2.009765625, vaq_loss: 2.48828125, qav_loss: 2.3202672004699707
[09:48:11.400204] vqa_loss: 1.90234375, vaq_loss: 2.208984375, qav_loss: 2.3291122913360596
[09:48:12.899157] vqa_loss: 2.033203125, vaq_loss: 2.5859375, qav_loss: 2.3255414962768555
[09:48:14.365934] vqa_loss: 1.8984375, vaq_loss: 2.486328125, qav_loss: 2.328566789627075
[09:48:15.935122] vqa_loss: 1.8720703125, vaq_loss: 2.34375, qav_loss: 2.3454670906066895
[09:48:17.412503] vqa_loss: 1.9130859375, vaq_loss: 2.47265625, qav_loss: 2.3222508430480957
[09:48:18.888214] vqa_loss: 2.033203125, vaq_loss: 2.328125, qav_loss: 2.3213603496551514
[09:48:20.486417] vqa_loss: 1.91796875, vaq_loss: 2.65625, qav_loss: 2.34771728515625
[09:48:21.978990] vqa_loss: 1.939453125, vaq_loss: 2.6015625, qav_loss: nan
[09:48:21.999707] Loss is nan, stopping training
==> out.902584.log <==
[09:44:14.089460] Start training for 5 epochs
[09:44:16.338710] vqa_loss: 1.9150390625, vaq_loss: 1.8759765625, qav_loss: 2.407137393951416
[09:44:16.696111] Epoch: [0] [ 0/9233] eta: 6:40:47 lr: 0.000000 loss: 6.1982 (6.1982) vqa_loss: 1.9150 (1.9150) vaq_loss: 1.8760 (1.8760) qav_loss: 2.4071 (2.4071) time: 2.6045 data: 0.1838 max mem: 32770
[09:44:17.516800] vqa_loss: 1.984375, vaq_loss: 1.0400390625, qav_loss: 2.2512660026550293
[09:44:18.640382] vqa_loss: 1.8154296875, vaq_loss: 1.4052734375, qav_loss: 2.313685655593872
[09:44:19.772095] vqa_loss: 1.72265625, vaq_loss: 1.5458984375, qav_loss: 2.22432017326355
[09:44:21.041153] vqa_loss: 1.9140625, vaq_loss: 1.4482421875, qav_loss: 2.2694244384765625
[09:44:22.171438] vqa_loss: 1.8427734375, vaq_loss: 2.1640625, qav_loss: 2.3169543743133545
[09:44:23.298976] vqa_loss: 1.81640625, vaq_loss: 1.9326171875, qav_loss: nan
[09:44:23.310836] Loss is nan, stopping training
🤔
In my environment, the model is trained well with the below command:
python train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav
Please make sure the environment of your machine is the same as the README.md.
But according to the printed, loss is not nan.
Command is the training command in README with some arguments about distributed training removed: