tae898 / erc

The official implementation of "EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa"
MIT License
88 stars 23 forks source link

About parameters #22

Closed XinyeDu1204 closed 2 years ago

XinyeDu1204 commented 2 years ago

I'm sorry to bother you again. Can you tell me that how to determine the setting of "epoch", "batch_size" and "HP_N_Trials" in yaml file? And why "PER_DEVICE_EVAL_BATCH_SIZE = BATCH_SIZE*2"? Thank you very much!

tae898 commented 2 years ago

Hi,

You are not bothering me. Any questions regarding my work are always welcome! I think you are talking about the file train-erc-text.yaml, right?

BATCH_SIZE: This depends on the memory of your GPU / CPU. AFAIK, I was only able to fit batch size of 4 on a 16GB GPU. NUM_TRAIN_EPOCHS: 5 was enough for me. There wasn't much performance improvement after 5 epochs. HP_N_TRIALS: Higher this value is, the better it is, since it will do more hyperparameter search.

And why "PER_DEVICE_EVAL_BATCH_SIZE = BATCH_SIZE*2"?

AFAIK, in the eval mode, gradient computation is not done so I can fit higher batch size. But this is not so important. You can leave this value at PER_DEVICE_EVAL_BATCH_SIZE = BATCH_SIZE if you want.

I think I'll refactor the code with torch lightning or something with better documentation. I should also dockerize the environment to ensure reproducibility. I can't guarantee by when this will be done, since it's the end of year and all ... haha. But follow me on github and stay tuned!

XinyeDu1204 commented 2 years ago

Hello,

I am very interested in your code and look forward to your update in the future!

tae898 commented 2 years ago

Because my GPU is only 10GB, it only allows me to set BATCH_SIZE to 1. My classmates told me that EPOCHS will increase as much as BATCH_SIZE decreases, but when I increase EPOCHS to 8, the performance of the model will gradually deteriorate after the fourth training, so is there no inevitable connection between BATCH_SIZE and EPOCHS?

Yeah I also used to train this on 10GB RTX2080Ti, and batch size of 1 was the maximum batch size. In my implementation, number of epochs is irrelevant to the batch size. One does not affect the other. In general, num_samples / batch_size = num_steps and num_samples is the data that the optimizer will observe in one epoch. I think your classmate confused num_steps with num_epochs. num_steps indeed increase as batch_size decreases.

In addition, does HP_N_TRIALS have an upper limit?

No, there is no upper limit. But I don't think this value has to be more than 10.

What are the settings of HP_ONLY_UPTO, WEIGHT_DECAY and WARMUP_RATIO based on? Is there a function to calculate them? Because I remember that the greater the WEIGHT_DECAY is, the greater the value of the model loss. I don't know whether this is correct.

HP_ONLY_UPTO, WEIGHT_DECAY, and ,WARMUP_RATIO are all hyperparameters. The "optimal" values of them highly depend on one's data, model, scheduling, etc. And that's the reason why I tried to do the automatic hyper parameter tuning since I don't want to do grid search on them.

Ideally, you don't have to mess with the hyperparameters so much, since I've already chosen the decent values for them. If your goal is to reproduce my work, ideally batch_size and learning_rate should be the only hyperparameters that you have to change. Let's say that you have to lower the batch_size, then perhaps you want to consider lowering the value of learning_rate as well.

XinyeDu1204 commented 2 years ago

OK! Thanks you for letting me understand more knowledge! I see that the adjustment of learning_rate is written in the train-erc-text-hp.py file. When I modify batch_size to 1, do I need to change the source code in the train-erc-text-hp.py file?

tae898 commented 2 years ago

Actually you don't have to change the learning rate since it's chosen by the automatic hyper parameter tuning in train-erc-text-hp.py:

def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
    }

best_run = trainer.hyperparameter_search(
    direction="minimize", hp_space=my_hp_space, n_trials=HP_N_TRIALS)
XinyeDu1204 commented 2 years ago

OK. I have seen this code~ Thank you very much for your patient guidance! I saw in the thesis yesterday that a linear layer was added. Can you tell me what the source code is to implement it?

1638671318(1)
XinyeDu1204 commented 2 years ago

I found that when NUM_TRAIN_EPOCHS is set to 7 or 8, the best result is when SEED is 4, and then it gradually decreases, as shown in the figure below. But when I set NUM_TRAIN_EPOCHS to 4, the result of the model decreased greatly. Can I control the NUM_TRAIN_EPOCHS to 7 and let the model run only the number of SEEDS(4) times?

01331d1faa1e6f99491d124d7390ac8

My understanding is that SEEDS increases with the increase of NUM_TRAIN_EPOCHS, as shown in the figure below.

aaaa9db5d5dfd4792a30f6b15308390
tae898 commented 2 years ago

OK. I have seen this code~ Thank you very much for your patient guidance! I saw in the thesis yesterday that a linear layer was added. Can you tell me what the source code is to implement it? 1638671318(1)

It's done here: (line 82 in train-erc-text-full.py)

model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=NUM_CLASSES)

Huggingface handles it nicely.

tae898 commented 2 years ago

I found that when NUM_TRAIN_EPOCHS is set to 7 or 8, the best result is when SEED is 4, and then it gradually decreases, as shown in the figure below. But when I set NUM_TRAIN_EPOCHS to 4, the result of the model decreased greatly. Can I control the NUM_TRAIN_EPOCHS to 7 and let the model run only the number of SEEDS(4) times? 01331d1faa1e6f99491d124d7390ac8 My understanding is that SEEDS increases with the increase of NUM_TRAIN_EPOCHS, as shown in the figure below. aaaa9db5d5dfd4792a30f6b15308390

SEED is not really a hyperparameter that should be tuned. It's for reproducibility and randomness.

Perhaps you can check out this post: https://vitalflux.com/why-use-random-seed-in-machine-learning/

XinyeDu1204 commented 2 years ago

So don't I need to change the number of SEEDS in train-erc-text.yaml? But why do you set them to "0,1,2,3,4" instead of "0,1,2,3" or others?

08d232385273a744dbe189bab95d4f0

Thank you!

tae898 commented 2 years ago

You don't have to change the seeds. It's a common practice to do five random runs and then report the mean and standard deviation of them. That's why I just chose five random seeds (i.e., 0, 1, 2, 3, 4), but these numbers can be anything.

XinyeDu1204 commented 2 years ago

OK. Thank you very much!!

XinyeDu1204 commented 2 years ago

I am sorry to bothering you again....I still have some questions... As you said, SEEDS represents five random training times. The following figure shows the F1 score of the test set, and each point represents one training. Is the data used for each training all the data in the test set, or is the data in the data set divided into five parts firstly, and then take one of them for each training? Results1 In addition, how to evaluate the performance of the model, whether to select the one with the highest F1 score in the five random training, or to average the results of the five training? Thank you!

tae898 commented 2 years ago

Hi!

You are not bothering me at all, haha! I always appreciate questions, cuz it means that there is always something that I can improve!

seeds are nothing but randomness. Let's say that I train N epochs and choose the checkpoint that has the best performance on the validation split. It's hard to say that this is a good model since this is the one that happens to perform good on a certain distribution.

In the real world, everything is stochastic. Let's say that you are sampling a height from the Dutch population, and you happen to sample a male who's 2.5 meters high. This does NOT represent the real average of it. You have to sample more humans to be certain about the estimate of the average value of it.

That being said, seed i tries to get the numbers of the train / val / test splits. Again, you have to "sample" the numbers multiple times with different seeds to be more certain about the mean and the std. of it.

Again that being said. seed i will train N epochs and choose the checkpoint that has the best validation score, and will evaluate the performance on the test split. Since this is not enough, you'll have to try with different seeds to get the best estimates.

tae898 commented 2 years ago

I feel like this is a not good enough answer. I will make a video soon and clarify things.

XinyeDu1204 commented 2 years ago

Thank you so much for your patient answers again! Do the checkpoint you said means the hyperparameters? And this is my understanding: After seed i runs N epochs times on the randomly a part of validation set, then select the checkpoint that perform best and apply them to a random part of the test set. And the validation/test set applied in each seed i is a random part. Is this right?

XinyeDu1204 commented 2 years ago

I just looked again and found that checkpoints are the files saved by the model every time it runs.

bbd710abd1f292d1c4423c4e4da6565

According to the 5 epochs I set, each time seed i runs, the dataset is divided into 5 parts(as shown the blue arrow in figure below), and then the five parts are predicted in turn, and the training results are improved in turn(as shown the red arrow in figure below). However, if these five parts of dataset are randomly selected from the total dataset, why does the result increase in turn?

1184dda0c89acf92d2bb8993f2d127f c3410507c7dac2c23f5f462c1ad4c31
tae898 commented 2 years ago

I think you are confusing with cross validation.

In my training scheme, train / val / test splits are just what they are. They are not further split into folds.

XinyeDu1204 commented 2 years ago

Therefore, the data set used in the N epochs is the whole validation dataset? But why is the F1 score getting better and better from the first epoch? If the data sets used are the same in each N epochs, I think the training results should be similar....

1184dda0c89acf92d2bb8993f2d127f c3410507c7dac2c23f5f462c1ad4c31
tae898 commented 2 years ago

In general, as more backprops are done, the performance metric improves on the training data split. The number of backprops increase with respect to number of steps or number of epochs.

XinyeDu1204 commented 2 years ago

Got it! Thank you! Therefore, after running seed i times, we can get i results. The correct evaluation method is to take the one with the highest F1 score as the standard for judging the model or it still need to calculate the average value of i results?

tae898 commented 2 years ago

Every seed is its own independent run. As for every seed, you choose the best model that has the best validation evaluation metric. And as for the final results, you report the average of the test evaluation metrics.