princeton-nlp / calm-textgame

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games
64 stars 7 forks source link

Having issues getting training to converge - hyper parameter issue? #8

Open dayvidwang opened 6 months ago

dayvidwang commented 6 months ago

I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.

Environment:

Python version: 3.6.15 Operating System: Ubuntu 20.04 GPU: Nvidia Titan RTX Dependencies: torch==1.4, transformers==2.5.1, jericho, fasttext, wandb, importlib_metadata

Issue:

The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.

Attempts:

Params Iteration Train Acc Val Acc Train Loss Val Loss
num GPU = 1
batch size = 1
1 0.14 0.15 2.38 2.35
2 0.18 0.14 2.01 2.35
3 0.22 0.15 1.80 2.43
4 0.26 0.14 1.63 2.56
5 0.30 0.14 1.50 2.71
num GPU = 3
batch size = 1
1 0.13 0.14 0.79 2.30
2 0.17 0.14 0.67 2.26
3 0.20 0.15 0.61 2.28
4 0.22 0.15 0.57 2.33
5 0.25 0.14 0.53 2.38
num GPU = 1
batch size = 15
1 0.10 0.13 0.18 2.32
2 0.13 0.13 0.15 2.28
3 0.15 0.14 0.14 2.27
4 0.17 0.14 0.13 2.27
5 0.18 0.14 0.13 2.31
num GPU = 3
batch size = 15
1 0.10 0.12 0.06 2.35
2 0.12 0.13 0.05 2.30
3 0.14 0.13 0.05 2.29
4 0.15 0.14 0.05 2.28
5 0.16 0.13 0.05 2.27
num GPU = 8
batch size = 12
1 0.09 0.11 0.03 2.41
2 0.12 0.12 0.03 2.34
3 0.13 0.13 0.02 2.31
4 0.14 0.13 0.02 2.29
5 0.14 0.14 0.02 2.29

Request:

Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?

Thank you for your time.

ysymyth commented 5 months ago

What do you mean by "not converging"? Also if I remember correctly, you don't need the CALM model to have near-zero training loss, or even converging training loss to function. Maybe just follow the codebase and run the RL experiments and see the scores? The train/test losses of LMs are not that valuable.