Hi, kimiyoung.
Your transformer-xl idea is interesting. I have some confusion about the git resp. In pytorch and tensorflow branch, 1-billion experiment parameters seems have some difference. First, mem_len and target_len is 32 in pytorch branch, but 256 in tensorflow branch. Second, warmup_steps is 20000 in pytorch branch, but 0 in tensorflow branch.
I ran two branch default experiments. 1) When I ran tensorflow resp, it occurs OOM in Tesla v100 32GB. If I modify mem_len and target_len from 256 to 32(the same to pytorch branch), it will run successfully. 2) When I ran pytorch resp, ppl will not be converged well after global_steps > warmup_steps. Experiment result log is following.
| Eval 15 at step 60000 | time: 4820.44s | valid loss 4.14 | valid ppl 63.088
----------------------------------------------------------------------------------------------------
| epoch 1 step 60200 | 60200 batches | lr 0.000241 | ms/batch 1287.42 | loss 4.19 | ppl 65.826
| epoch 1 step 60400 | 60400 batches | lr 0.000241 | ms/batch 1192.32 | loss 4.19 | ppl 65.939
| epoch 1 step 60600 | 60600 batches | lr 0.000241 | ms/batch 1185.95 | loss 4.18 | ppl 65.057
| epoch 1 step 60800 | 60800 batches | lr 0.000241 | ms/batch 1231.65 | loss 4.18 | ppl 65.625
| epoch 1 step 61000 | 61000 batches | lr 0.000241 | ms/batch 1194.10 | loss 4.15 | ppl 63.581
| epoch 1 step 61200 | 61200 batches | lr 0.000241 | ms/batch 1199.46 | loss 4.14 | ppl 63.097
| epoch 1 step 61400 | 61400 batches | lr 0.000241 | ms/batch 1201.29 | loss 4.21 | ppl 67.300
| epoch 1 step 61600 | 61600 batches | lr 0.000241 | ms/batch 1199.24 | loss 4.17 | ppl 64.404
| epoch 1 step 61800 | 61800 batches | lr 0.000241 | ms/batch 1200.72 | loss 4.17 | ppl 64.666
| epoch 1 step 62000 | 62000 batches | lr 0.000241 | ms/batch 1236.83 | loss 4.17 | ppl 64.656
| epoch 1 step 62200 | 62200 batches | lr 0.000241 | ms/batch 1194.24 | loss 4.18 | ppl 65.127
| epoch 1 step 62400 | 62400 batches | lr 0.000241 | ms/batch 1200.48 | loss 4.14 | ppl 63.092
| epoch 1 step 62600 | 62600 batches | lr 0.00024 | ms/batch 1200.53 | loss 4.16 | ppl 64.340
| epoch 1 step 62800 | 62800 batches | lr 0.00024 | ms/batch 1198.63 | loss 4.22 | ppl 67.842
| epoch 1 step 63000 | 63000 batches | lr 0.00024 | ms/batch 1237.70 | loss 4.82 | ppl 123.808
| epoch 1 step 63200 | 63200 batches | lr 0.00024 | ms/batch 1201.79 | loss 4.22 | ppl 68.274
| epoch 1 step 63400 | 63400 batches | lr 0.00024 | ms/batch 1197.27 | loss 4.17 | ppl 64.419
| epoch 1 step 63600 | 63600 batches | lr 0.00024 | ms/batch 1198.99 | loss 4.14 | ppl 63.108
| epoch 1 step 63800 | 63800 batches | lr 0.00024 | ms/batch 1199.07 | loss 4.16 | ppl 63.895
| epoch 1 step 64000 | 64000 batches | lr 0.00024 | ms/batch 1202.39 | loss 4.18 | ppl 65.469
----------------------------------------------------------------------------------------------------
| Eval 16 at step 64000 | time: 4833.95s | valid loss 4.14 | valid ppl 62.631
----------------------------------------------------------------------------------------------------
| epoch 1 step 64200 | 64200 batches | lr 0.00024 | ms/batch 1340.08 | loss 4.91 | ppl 135.696
| epoch 1 step 64400 | 64400 batches | lr 0.00024 | ms/batch 1204.45 | loss 7.40 | ppl 1635.990
| epoch 1 step 64600 | 64600 batches | lr 0.00024 | ms/batch 1201.57 | loss 7.38 | ppl 1610.122
| epoch 1 step 64800 | 64800 batches | lr 0.00024 | ms/batch 1203.70 | loss 7.38 | ppl 1604.190
| epoch 1 step 65000 | 65000 batches | lr 0.00024 | ms/batch 1203.73 | loss 7.38 | ppl 1595.873
| epoch 1 step 65200 | 65200 batches | lr 0.00024 | ms/batch 1201.02 | loss 7.37 | ppl 1594.470
| epoch 1 step 65400 | 65400 batches | lr 0.00024 | ms/batch 1242.47 | loss 7.38 | ppl 1597.216
| epoch 1 step 65600 | 65600 batches | lr 0.00024 | ms/batch 1201.53 | loss 7.38 | ppl 1598.886
| epoch 1 step 65800 | 65800 batches | lr 0.000239 | ms/batch 1198.22 | loss 7.38 | ppl 1603.966
| epoch 1 step 66000 | 66000 batches | lr 0.000239 | ms/batch 1203.56 | loss 7.37 | ppl 1590.723
| epoch 1 step 66200 | 66200 batches | lr 0.000239 | ms/batch 1200.70 | loss 7.37 | ppl 1591.490
| epoch 1 step 66400 | 66400 batches | lr 0.000239 | ms/batch 1242.02 | loss 7.38 | ppl 1604.781
| epoch 1 step 66600 | 66600 batches | lr 0.000239 | ms/batch 1201.32 | loss 7.38 | ppl 1597.003
| epoch 1 step 66800 | 66800 batches | lr 0.000239 | ms/batch 1199.41 | loss 7.37 | ppl 1589.603
| epoch 1 step 67000 | 67000 batches | lr 0.000239 | ms/batch 1201.26 | loss 7.37 | ppl 1587.494
| epoch 1 step 67200 | 67200 batches | lr 0.000239 | ms/batch 1197.26 | loss 7.37 | ppl 1592.132
| epoch 1 step 67400 | 67400 batches | lr 0.000239 | ms/batch 1199.51 | loss 7.37 | ppl 1589.197
| epoch 1 step 67600 | 67600 batches | lr 0.000239 | ms/batch 1238.72 | loss 7.37 | ppl 1588.119
| epoch 1 step 67800 | 67800 batches | lr 0.000239 | ms/batch 1199.35 | loss 7.37 | ppl 1592.732
| epoch 1 step 68000 | 68000 batches | lr 0.000239 | ms/batch 1193.72 | loss 7.37 | ppl 1591.809
----------------------------------------------------------------------------------------------------
| Eval 17 at step 68000 | time: 4856.52s | valid loss 7.36 | valid ppl 1572.411
I think I only need to set batch_size smaller or mem_len smaller from 256 to 32 because of GPU memory limitation. How mem_len and target_len affects the experiment result? How 1-billion experiment can be converged? Would you please give me some advices?
Hi, kimiyoung. Your transformer-xl idea is interesting. I have some confusion about the git resp. In pytorch and tensorflow branch, 1-billion experiment parameters seems have some difference. First, mem_len and target_len is 32 in pytorch branch, but 256 in tensorflow branch. Second, warmup_steps is 20000 in pytorch branch, but 0 in tensorflow branch. I ran two branch default experiments. 1) When I ran tensorflow resp, it occurs OOM in Tesla v100 32GB. If I modify mem_len and target_len from 256 to 32(the same to pytorch branch), it will run successfully. 2) When I ran pytorch resp, ppl will not be converged well after global_steps > warmup_steps. Experiment result log is following.
I think I only need to set batch_size smaller or mem_len smaller from 256 to 32 because of GPU memory limitation. How mem_len and target_len affects the experiment result? How 1-billion experiment can be converged? Would you please give me some advices?