utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

Can't update the train_batch_size and eval_batch_size for the training image in a docker container Can't update the train_batch_size and eval_batch_size for the training image in a docker container #193

Open tbs17 opened 4 years ago

tbs17 commented 4 years ago

I originally meant to create issue here, but ended up in the transformer's github repo. Repos:

I tried to train a couple multi-label models with the fast-bert libary using the container files to build up the docker image and uploaded to AWS ECR and used the aws helper notebook that's included in the 'sample notebook' folder in the repo. I have trained for 3 models and regardless I changed to a different train_batch_size in the hyperparameters.json file, the model when training still outputs total train batch size is 64 and eval batch size is 128.

My question here:

Am i not able to update the train batch size? if the training is happening in a container?

Does the training and eval batch size have some relationship? from a glance, it looks like eval_batch_size is doubled train_batch_size. I'm gonna say there shouldn't be any relationship. However, why there's no parameter set in the hyperparameters.json to specify the eval_batch_size?

The three models i have trained all got really good accuracy_thresh with above 0.97. However, one of the models only outputs 2 classes as the top probability class. The original data has about 9455 rows and 113 classes. I have also trained it on BERT tensorflow version, i was able to get multiple labels as top predicted class. what could be possibly wrong? Note that my other 2 models has about 36 classes and 11 classes. The top predicted classes all came out reasonable, meaning all the 36 and 11 classes showed up for the top predicted class. In addition, i don't see the performance change whenever i changed the accuracy_thresh change after epoch 2.

Please provide some guidance as this is going into deployment soon but I'm still struggling to figure out why....

============= Below is my hyperparameters.json file

hyperparameters = { "epochs": 10, "lr": 3e-5, "max_seq_length": 512, "train_batch_size": 4, "eval_batch_size": 4, "lr_schedule": "warmup_cosine", "warmup_steps": 1000, "optimizer_type": "adamw" } with open(CONFIG_PATH/'hyperparameters.json', 'w') as f: json.dump(hyperparameters, f)