Having issues getting training to converge - hyper parameter issue?

I'm attempting to run the training script for the GPT-2 CALM on the ClubFloyd dataset, following the instructions from your EMNLP 2020 paper. I've set up my environment as recommended but am facing challenges with the training process.

Environment:

Python version: 3.6.15 Operating System: Ubuntu 20.04 GPU: Nvidia Titan RTX Dependencies: torch==1.4, transformers==2.5.1, jericho, fasttext, wandb, importlib_metadata

Issue:

The training doesn't perform as expected (training overfits to training data while validation performance hardly improves or worsens), even after adjusting hyperparameters like batch size and GPU count.

Attempts:

Params	Iteration	Train Acc	Val Acc	Train Loss	Val Loss
num GPU = 1 batch size = 1	1	0.14	0.15	2.38	2.35
	2	0.18	0.14	2.01	2.35
	3	0.22	0.15	1.80	2.43
	4	0.26	0.14	1.63	2.56
	5	0.30	0.14	1.50	2.71
num GPU = 3 batch size = 1	1	0.13	0.14	0.79	2.30
	2	0.17	0.14	0.67	2.26
	3	0.20	0.15	0.61	2.28
	4	0.22	0.15	0.57	2.33
	5	0.25	0.14	0.53	2.38
num GPU = 1 batch size = 15	1	0.10	0.13	0.18	2.32
	2	0.13	0.13	0.15	2.28
	3	0.15	0.14	0.14	2.27
	4	0.17	0.14	0.13	2.27
	5	0.18	0.14	0.13	2.31
num GPU = 3 batch size = 15	1	0.10	0.12	0.06	2.35
	2	0.12	0.13	0.05	2.30
	3	0.14	0.13	0.05	2.29
	4	0.15	0.14	0.05	2.28
	5	0.16	0.13	0.05	2.27
num GPU = 8 batch size = 12	1	0.09	0.11	0.03	2.41
	2	0.12	0.12	0.03	2.34
	3	0.13	0.13	0.02	2.31
	4	0.14	0.13	0.02	2.29
	5	0.14	0.14	0.02	2.29

Request:

Do you have any ideas on why these training runs might not be converging, whether it be hardware difference, hyperparameter difference, or something else?

Thank you for your time.

princeton-nlp / calm-textgame