Open tig3rmast3r opened 1 month ago
Hi!
Does this behavior occur across different random seeds too? E.g. if you restarted training using a different random seed, would you notice this same loss pattern again? Also, are the actual loss values (not the pattern) different every time you reshuffle the data or are they the same?
Also, what does the train/val loss curve look like across all iterations? would you mind sharing some loss plots from tensorboard?
About tensorboard unfortunately i always trash logs folder for archived trainings, but i can provide the below csv for the latest training i did on my pc, every time there is a double value for the same iteration is because there was a resume. i restarted training iterations at some points (kept only weights.pth) With larger batch sizes (16 or 24, when i use vast.ai and multi gpu) i've noticed bigger patterns sometimes, like 3 bad and 1 good. i've used rlrop for learning rate decay curve on this training but i've noticed this pattern also in the past with Noam. outputfix.csv
Lastly, just for testing, i've resumed the above training changing batch size from 4 to 2, the issue is now much more evident, here's the log (there's a resume at iteration 98922):
[03:44:14] Loading checkpoint from 3020sven-24-b2/latest decorators.py:220
[03:50:21] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
Best model so far decorators.py:220
[03:50:27] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 0 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.031342 │ 0.031342
accuracy-0-0.5/top1/unmasked │ 0.021346 │ 0.021346
accuracy-0-0.5/top25/masked │ 0.261384 │ 0.261384
accuracy-0-0.5/top25/unmasked │ 0.292282 │ 0.292282
accuracy-0.5-1.0/top1/masked │ 0.127371 │ 0.127371
accuracy-0.5-1.0/top1/unmasked │ 0.041262 │ 0.041262
accuracy-0.5-1.0/top25/masked │ 0.563686 │ 0.563686
accuracy-0.5-1.0/top25/unmasked │ 0.316748 │ 0.316748
loss │ 5.581545 │ 5.581545
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.059813 │ 1.059813
other/learning_rate │ 0.000090 │ 0.000090
time/train_loop │ 125.917442 │ 125.917442
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 5.014778 │ 5.811567
accuracy-0-0.5/top1/unmasked │ nan │ 0.031041
accuracy-0-0.5/top1/masked │ nan │ 0.043669
accuracy-0-0.5/top25/unmasked │ nan │ 0.278061
accuracy-0-0.5/top25/masked │ nan │ 0.273002
accuracy-0.5-1.0/top1/unmasked │ 0.023385 │ 0.024148
accuracy-0.5-1.0/top1/masked │ 0.134094 │ 0.174240
accuracy-0.5-1.0/top25/unmasked │ 0.289532 │ 0.263240
accuracy-0.5-1.0/top25/masked │ 0.537803 │ 0.587320
time/val_loop │ 48.461155 │ 0.191444
╵ ╵
⠏ Iteration (train) 1/2473050 0:06:06 / -:--:--
⠏ Iteration (val) 0/583 0:00:00 / 0:00:00
[06:53:40] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
Best model so far decorators.py:220
[06:53:51] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 49461 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.056538 │ 0.055277
accuracy-0-0.5/top1/unmasked │ 0.027350 │ 0.043594
accuracy-0-0.5/top25/masked │ 0.359402 │ 0.304963
accuracy-0-0.5/top25/unmasked │ 0.305983 │ 0.344635
accuracy-0.5-1.0/top1/masked │ nan │ 0.169889
accuracy-0.5-1.0/top1/unmasked │ nan │ 0.031789
accuracy-0.5-1.0/top25/masked │ nan │ 0.578831
accuracy-0.5-1.0/top25/unmasked │ nan │ 0.295991
loss │ 5.743196 │ 5.374013
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.216359 │ 1.266699
other/learning_rate │ 0.000090 │ 0.000090
time/train_loop │ 0.179107 │ 0.178402
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 4.348414 │ 5.526126
accuracy-0-0.5/top1/unmasked │ nan │ 0.029400
accuracy-0-0.5/top1/masked │ nan │ 0.041464
accuracy-0-0.5/top25/unmasked │ nan │ 0.275722
accuracy-0-0.5/top25/masked │ nan │ 0.264649
accuracy-0.5-1.0/top1/unmasked │ 0.020939 │ 0.022673
accuracy-0.5-1.0/top1/masked │ 0.262431 │ 0.147841
accuracy-0.5-1.0/top25/unmasked │ 0.230964 │ 0.251378
accuracy-0.5-1.0/top25/masked │ 0.664365 │ 0.545866
time/val_loop │ 0.056837 │ 0.108232
╵ ╵
⠏ Iteration (train) 49462/2473050 3:09:29 / 148:48:13
⠏ Iteration (val) 0/583 0:00:00 / 0:00:00
[09:57:01] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
[09:57:06] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 98922 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.117355 │ 0.054780
accuracy-0-0.5/top1/unmasked │ 0.035052 │ 0.043131
accuracy-0-0.5/top25/masked │ 0.510193 │ 0.303911
accuracy-0-0.5/top25/unmasked │ 0.385567 │ 0.334277
accuracy-0.5-1.0/top1/masked │ 0.125369 │ 0.198964
accuracy-0.5-1.0/top1/unmasked │ 0.024364 │ 0.032152
accuracy-0.5-1.0/top25/masked │ 0.505900 │ 0.618184
accuracy-0.5-1.0/top25/unmasked │ 0.309322 │ 0.299244
loss │ 5.131485 │ 5.686758
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.511617 │ 1.245741
other/learning_rate │ 0.000090 │ 0.000090
time/train_loop │ 0.179003 │ 0.178563
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 6.730142 │ 5.826073
accuracy-0-0.5/top1/unmasked │ 0.000000 │ 0.029817
accuracy-0-0.5/top1/masked │ 0.002178 │ 0.042216
accuracy-0-0.5/top25/unmasked │ 0.000000 │ 0.281199
accuracy-0-0.5/top25/masked │ 0.063589 │ 0.266826
accuracy-0.5-1.0/top1/unmasked │ nan │ 0.023898
accuracy-0.5-1.0/top1/masked │ nan │ 0.178145
accuracy-0.5-1.0/top25/unmasked │ nan │ 0.262936
accuracy-0.5-1.0/top25/masked │ nan │ 0.589535
time/val_loop │ 0.057392 │ 0.109555
╵ ╵
⠙ Iteration (train) 98923/2473050 ╸ 6:12:45 / 144:34:17
⠙ Iteration (val) 0/583 0:00:00 / 0:00:00
[04:08:00] Loading checkpoint from 3020sven-24-b2/latest decorators.py:220
[04:14:59] Loading checkpoint from 3020sven-24-b2/latest decorators.py:220
[04:20:40] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
[04:20:44] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 98922 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.083383 │ 0.083383
accuracy-0-0.5/top1/unmasked │ 0.019704 │ 0.019704
accuracy-0-0.5/top25/masked │ 0.442342 │ 0.442342
accuracy-0-0.5/top25/unmasked │ 0.256158 │ 0.256158
accuracy-0.5-1.0/top1/masked │ 0.073171 │ 0.073171
accuracy-0.5-1.0/top1/unmasked │ 0.070388 │ 0.070388
accuracy-0.5-1.0/top25/masked │ 0.432927 │ 0.432927
accuracy-0.5-1.0/top25/unmasked │ 0.447816 │ 0.447816
loss │ 5.470975 │ 5.470975
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.042029 │ 1.042029
other/learning_rate │ 0.000090 │ 0.000090
time/train_loop │ 112.012772 │ 112.012772
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 5.033357 │ 5.823171
accuracy-0-0.5/top1/unmasked │ nan │ 0.030750
accuracy-0-0.5/top1/masked │ nan │ 0.042924
accuracy-0-0.5/top25/unmasked │ nan │ 0.276792
accuracy-0-0.5/top25/masked │ nan │ 0.270614
accuracy-0.5-1.0/top1/unmasked │ 0.018931 │ 0.023358
accuracy-0.5-1.0/top1/masked │ 0.134807 │ 0.172273
accuracy-0.5-1.0/top25/unmasked │ 0.276169 │ 0.260654
accuracy-0.5-1.0/top25/masked │ 0.526391 │ 0.584483
time/val_loop │ 44.559895 │ 0.184471
╵ ╵
⠏ Iteration (train) 98923/2473050 ╸ 0:05:38 / -:--:--
⠏ Iteration (val) 0/583 0:00:00 / 0:00:00
[07:23:45] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
[07:23:49] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 148383 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.040847 │ 0.054950
accuracy-0-0.5/top1/unmasked │ 0.059829 │ 0.043299
accuracy-0-0.5/top25/masked │ 0.314819 │ 0.303498
accuracy-0-0.5/top25/unmasked │ 0.427350 │ 0.342117
accuracy-0.5-1.0/top1/masked │ nan │ 0.169778
accuracy-0.5-1.0/top1/unmasked │ nan │ 0.031349
accuracy-0.5-1.0/top25/masked │ nan │ 0.578397
accuracy-0.5-1.0/top25/unmasked │ nan │ 0.292735
loss │ 5.940801 │ 5.377327
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.548968 │ 1.267261
other/learning_rate │ 0.000090 │ 0.000090
time/train_loop │ 0.178283 │ 0.178510
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 4.350002 │ 5.526492
accuracy-0-0.5/top1/unmasked │ nan │ 0.030643
accuracy-0-0.5/top1/masked │ nan │ 0.041276
accuracy-0-0.5/top25/unmasked │ nan │ 0.284138
accuracy-0-0.5/top25/masked │ nan │ 0.264920
accuracy-0.5-1.0/top1/unmasked │ 0.022208 │ 0.022672
accuracy-0.5-1.0/top1/masked │ 0.261050 │ 0.148421
accuracy-0.5-1.0/top25/unmasked │ 0.229695 │ 0.251670
accuracy-0.5-1.0/top25/masked │ 0.679558 │ 0.545708
time/val_loop │ 0.059183 │ 0.108220
╵ ╵
⠸ Iteration (train) 148384/2473050 ╸ 3:08:43 / 141:29:35
⠸ Iteration (val) 0/583 0:00:00 / 0:00:00
[10:26:51] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
[10:26:55] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 197844 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.107989 │ 0.054989
accuracy-0-0.5/top1/unmasked │ 0.016495 │ 0.044490
accuracy-0-0.5/top25/masked │ 0.479890 │ 0.304197
accuracy-0-0.5/top25/unmasked │ 0.274227 │ 0.336267
accuracy-0.5-1.0/top1/masked │ 0.146755 │ 0.199176
accuracy-0.5-1.0/top1/unmasked │ 0.031780 │ 0.032188
accuracy-0.5-1.0/top25/masked │ 0.573746 │ 0.618924
accuracy-0.5-1.0/top25/unmasked │ 0.309322 │ 0.299661
loss │ 5.113540 │ 5.685070
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.068045 │ 1.274413
other/learning_rate │ 0.000081 │ 0.000081
time/train_loop │ 0.178242 │ 0.178444
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 6.739600 │ 5.825361
accuracy-0-0.5/top1/unmasked │ 0.000000 │ 0.028623
accuracy-0-0.5/top1/masked │ 0.000871 │ 0.042174
accuracy-0-0.5/top25/unmasked │ 0.000000 │ 0.264988
accuracy-0-0.5/top25/masked │ 0.060105 │ 0.266872
accuracy-0.5-1.0/top1/unmasked │ nan │ 0.023251
accuracy-0.5-1.0/top1/masked │ nan │ 0.178347
accuracy-0.5-1.0/top25/unmasked │ nan │ 0.255530
accuracy-0.5-1.0/top25/masked │ nan │ 0.590395
time/val_loop │ 0.058676 │ 0.107920
╵ ╵
⠇ Iteration (train) 197845/2473050 ━ 6:11:49 / 138:37:12
⠇ Iteration (val) 0/583 0:00:00 / 0:00:00
[13:30:11] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
[13:30:15] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 247305 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ nan │ 0.055067
accuracy-0-0.5/top1/unmasked │ nan │ 0.043970
accuracy-0-0.5/top25/masked │ nan │ 0.304022
accuracy-0-0.5/top25/unmasked │ nan │ 0.345185
accuracy-0.5-1.0/top1/masked │ 0.223970 │ 0.170392
accuracy-0.5-1.0/top1/unmasked │ 0.032293 │ 0.031604
accuracy-0.5-1.0/top25/masked │ 0.698482 │ 0.579534
accuracy-0.5-1.0/top25/unmasked │ 0.319666 │ 0.293879
loss │ 4.319188 │ 5.374097
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 1.391422 │ 1.309109
other/learning_rate │ 0.000081 │ 0.000081
time/train_loop │ 0.176921 │ 0.179162
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 6.733348 │ 5.532683
accuracy-0-0.5/top1/unmasked │ 0.000000 │ 0.029171
accuracy-0-0.5/top1/masked │ 0.004367 │ 0.041208
accuracy-0-0.5/top25/unmasked │ 0.300000 │ 0.280217
accuracy-0-0.5/top25/masked │ 0.080349 │ 0.265034
accuracy-0.5-1.0/top1/unmasked │ nan │ 0.022550
accuracy-0.5-1.0/top1/masked │ nan │ 0.146544
accuracy-0.5-1.0/top25/unmasked │ nan │ 0.250341
accuracy-0.5-1.0/top25/masked │ nan │ 0.543372
time/val_loop │ 0.058948 │ 0.108363
╵ ╵
⠋ Iteration (train) 247306/2473050 ━╸ 9:15:09 / 136:07:23
⠋ Iteration (val) 0/583 0:00:00 / 0:00:00
[16:33:36] Saving to /home/tig3mast3r/vampnet11/vampnet decorators.py:220
[16:33:41] ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ decorators.py:220
┃ Iteration 296766 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
train
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
accuracy-0-0.5/top1/masked │ 0.006087 │ 0.055028
accuracy-0-0.5/top1/unmasked │ nan │ 0.044229
accuracy-0-0.5/top25/masked │ 0.106957 │ 0.304107
accuracy-0-0.5/top25/unmasked │ nan │ 0.338769
accuracy-0.5-1.0/top1/masked │ 0.333333 │ 0.200244
accuracy-0.5-1.0/top1/unmasked │ 0.018735 │ 0.032348
accuracy-0.5-1.0/top25/masked │ 0.769697 │ 0.619920
accuracy-0.5-1.0/top25/unmasked │ 0.242623 │ 0.300489
loss │ 6.452039 │ 5.684814
other/batch_size │ 2.000000 │ 2.000000
other/grad_norm │ 0.435003 │ 1.324875
other/learning_rate │ 0.000073 │ 0.000073
time/train_loop │ 0.179577 │ 0.179045
╵ ╵
val
╷ ╷
key │ value │ mean
╶───────────────────────────────────────┼──────────────┼──────────────╴
loss │ 4.054376 │ 5.821749
accuracy-0-0.5/top1/unmasked │ nan │ 0.030596
accuracy-0-0.5/top1/masked │ nan │ 0.042155
accuracy-0-0.5/top25/unmasked │ nan │ 0.277808
accuracy-0-0.5/top25/masked │ nan │ 0.269246
accuracy-0.5-1.0/top1/unmasked │ 0.016103 │ 0.023027
accuracy-0.5-1.0/top1/masked │ 0.270023 │ 0.176346
accuracy-0.5-1.0/top25/unmasked │ 0.224906 │ 0.253517
accuracy-0.5-1.0/top25/masked │ 0.750572 │ 0.587035
time/val_loop │ 0.057713 │ 0.108714
╵ ╵
⠏ Iteration (train) 296767/2473050 ━╸ 12:18:35 / 132:37:49
⠏ Iteration (val) 0/583 0:00:00 / 0:00:00
another strange thing is that if i redo a validation on the checkpoint that had 5.52 as loss i get a 5.8ish one. looks like the "good" results are somewhat "fake"
here's a graph from the csv (cleaned up reduntant ones) Below another graph from an older training, this issue almost disappears for larger batch sizes and another one
about validation loss i did some more tests, basically the values given during training can't be used as exact metric to compare results. if i redo a validation (using --resume keeping only weigths.pth without changing any other parameter), i get the same values if i repeat the test, with tolerance +-0.001, and results differs from the values given by the train loop. even with higher batch size (i've rented a 4x 4090 and resumed the above training, so batch size at 20), so minimizing the issue above, results differs. here's a comparison between values during training vs values got using a comparable validation:
step | val during training | comparable val | delta -- | -- | -- | -- 74k | 5.741 | 5.757 | -0.016 79k | 5.759 | 5.761 | -0.002 84k | 5.734 | 5.757 | -0.023 108k | 5.760 | 5.752 | 0.008 123k | 5.724 | 5.747 | -0,023 133k | 5.737 | 5.748 | -0,011 138k | 5.737 | 5.744 | -0,007 so looking at the values while training it looks it goes up and down. The truth is that is less jumpy and good values are not "that" good. It may happen that the "best" checkpoint detected by the train loop may not be the right one. in the above example it's still marking the 123k as the best one but in reality it has been beaten by 138k already. Would be great if we can use a fixed validation setting during training that is detached from the train settings and not affected by dropout or current learning rate or other parameters in order to have a more consistent result. Lastly, this behavior affects both train and val loops, as i always have almost the same delta between train and val losses
Hi Hugo, i'm encountering a very strange behavior during training, basically it cycles validations giving 1 higher loss followed by a lower loss, and so on. below are the validation loss from the latest training for example: 5.92 5.89 5.93 5.88 5.92 5.88 5.91 5.86 5.90
Next one will be "good" probably consider that learning rate is fixed as i'm using rlrop optimizer At first i was thinking there was something wrong audiodataset shuffle from audiotools, so i have disabled shuffle for the validation set and i have forced a reshuffke after each validation cycle using timestamp as seed to make sure that it will be different for each cycle, but i still have this behavior alternating one good and 1 bad. Dataset/train loss also follows this behavior no matter if i reshuffle so i'm wondering if there is something else i'm not aware of or if there's something that doesn't works as expected during the shuffle.
in vampnet.yml i have the below settings:
AudioDataset.without_replacement: true AudioLoader.shuffle: true val/AudioLoader.shuffle: false
one training cycle is exactly 1 ephoc (90k+ chunks)
what i have noticed from the console:
the output says there is a shuffle on the AudioLoader but on the Audiodataset is False. Don't know if is related
what else i could look for ? it shouldn't behave like this assuming the randomness of provided training data. thanks