Closed XinyeDu1204 closed 2 years ago
Hi,
You are not bothering me. Any questions regarding my work are always welcome!
I think you are talking about the file train-erc-text.yaml
, right?
BATCH_SIZE
: This depends on the memory of your GPU / CPU. AFAIK, I was only able to fit batch size of 4 on a 16GB GPU.
NUM_TRAIN_EPOCHS
: 5 was enough for me. There wasn't much performance improvement after 5 epochs.
HP_N_TRIALS
: Higher this value is, the better it is, since it will do more hyperparameter search.
And why "PER_DEVICE_EVAL_BATCH_SIZE = BATCH_SIZE*2"?
AFAIK, in the eval mode, gradient computation is not done so I can fit higher batch size. But this is not so important. You can leave this value at PER_DEVICE_EVAL_BATCH_SIZE = BATCH_SIZE
if you want.
I think I'll refactor the code with torch lightning or something with better documentation. I should also dockerize the environment to ensure reproducibility. I can't guarantee by when this will be done, since it's the end of year and all ... haha. But follow me on github and stay tuned!
Hello,
Because my GPU is only 10GB, it only allows me to set BATCH_SIZE
to 1. My classmates told me that EPOCHS
will increase as much as BATCH_SIZE
decreases, but when I increase EPOCHS
to 8, the performance of the model will gradually deteriorate after the fourth training, so is there no inevitable connection between BATCH_SIZE
and EPOCHS
?
In addition, does HP_N_TRIALS
have an upper limit?
What are the settings of HP_ONLY_UPTO
, WEIGHT_DECAY
and WARMUP_RATIO
based on? Is there a function to calculate them? Because I remember that the greater the WEIGHT_DECAY
is, the greater the value of the model loss. I don't know whether this is correct.
I am very interested in your code and look forward to your update in the future!
Because my GPU is only 10GB, it only allows me to set BATCH_SIZE to 1. My classmates told me that EPOCHS will increase as much as BATCH_SIZE decreases, but when I increase EPOCHS to 8, the performance of the model will gradually deteriorate after the fourth training, so is there no inevitable connection between BATCH_SIZE and EPOCHS?
Yeah I also used to train this on 10GB RTX2080Ti, and batch size of 1 was the maximum batch size.
In my implementation, number of epochs is irrelevant to the batch size. One does not affect the other.
In general, num_samples
/ batch_size
= num_steps
and num_samples
is the data that the optimizer will observe in one epoch. I think your classmate confused num_steps
with num_epochs
. num_steps
indeed increase as batch_size
decreases.
In addition, does HP_N_TRIALS have an upper limit?
No, there is no upper limit. But I don't think this value has to be more than 10.
What are the settings of HP_ONLY_UPTO, WEIGHT_DECAY and WARMUP_RATIO based on? Is there a function to calculate them? Because I remember that the greater the WEIGHT_DECAY is, the greater the value of the model loss. I don't know whether this is correct.
HP_ONLY_UPTO
, WEIGHT_DECAY
, and ,WARMUP_RATIO
are all hyperparameters. The "optimal" values of them highly depend on one's data, model, scheduling, etc. And that's the reason why I tried to do the automatic hyper parameter tuning since I don't want to do grid search on them.
Ideally, you don't have to mess with the hyperparameters so much, since I've already chosen the decent values for them. If your goal is to reproduce my work, ideally batch_size
and learning_rate
should be the only hyperparameters that you have to change. Let's say that you have to lower the batch_size
, then perhaps you want to consider lowering the value of learning_rate
as well.
OK! Thanks you for letting me understand more knowledge!
I see that the adjustment of learning_rate
is written in the train-erc-text-hp.py
file. When I modify batch_size
to 1, do I need to change the source code in the train-erc-text-hp.py
file?
Actually you don't have to change the learning rate since it's chosen by the automatic hyper parameter tuning in train-erc-text-hp.py
:
def my_hp_space(trial):
return {
"learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True),
}
best_run = trainer.hyperparameter_search(
direction="minimize", hp_space=my_hp_space, n_trials=HP_N_TRIALS)
OK. I have seen this code~ Thank you very much for your patient guidance! I saw in the thesis yesterday that a linear layer was added. Can you tell me what the source code is to implement it?
I found that when NUM_TRAIN_EPOCHS
is set to 7 or 8, the best result is when SEED
is 4, and then it gradually decreases, as shown in the figure below. But when I set NUM_TRAIN_EPOCHS
to 4, the result of the model decreased greatly. Can I control the NUM_TRAIN_EPOCHS
to 7 and let the model run only the number of SEEDS
(4) times?
My understanding is that SEEDS
increases with the increase of NUM_TRAIN_EPOCHS
, as shown in the figure below.
OK. I have seen this code~ Thank you very much for your patient guidance! I saw in the thesis yesterday that a linear layer was added. Can you tell me what the source code is to implement it?
It's done here: (line 82 in train-erc-text-full.py
)
model = AutoModelForSequenceClassification.from_pretrained(
model_checkpoint, num_labels=NUM_CLASSES)
Huggingface handles it nicely.
I found that when
NUM_TRAIN_EPOCHS
is set to 7 or 8, the best result is whenSEED
is 4, and then it gradually decreases, as shown in the figure below. But when I setNUM_TRAIN_EPOCHS
to 4, the result of the model decreased greatly. Can I control theNUM_TRAIN_EPOCHS
to 7 and let the model run only the number ofSEEDS
(4) times? My understanding is thatSEEDS
increases with the increase ofNUM_TRAIN_EPOCHS
, as shown in the figure below.
SEED
is not really a hyperparameter that should be tuned. It's for reproducibility and randomness.
Perhaps you can check out this post: https://vitalflux.com/why-use-random-seed-in-machine-learning/
So don't I need to change the number of SEEDS
in train-erc-text.yaml
? But why do you set them to "0,1,2,3,4" instead of "0,1,2,3" or others?
Thank you!
You don't have to change the seeds. It's a common practice to do five random runs and then report the mean and standard deviation of them. That's why I just chose five random seeds (i.e., 0, 1, 2, 3, 4), but these numbers can be anything.
OK. Thank you very much!!
I am sorry to bothering you again....I still have some questions... As you said, SEEDS represents five random training times. The following figure shows the F1 score of the test set, and each point represents one training. Is the data used for each training all the data in the test set, or is the data in the data set divided into five parts firstly, and then take one of them for each training? In addition, how to evaluate the performance of the model, whether to select the one with the highest F1 score in the five random training, or to average the results of the five training? Thank you!
Hi!
You are not bothering me at all, haha! I always appreciate questions, cuz it means that there is always something that I can improve!
seeds are nothing but randomness. Let's say that I train N epochs and choose the checkpoint that has the best performance on the validation split. It's hard to say that this is a good model since this is the one that happens to perform good on a certain distribution.
In the real world, everything is stochastic. Let's say that you are sampling a height from the Dutch population, and you happen to sample a male who's 2.5 meters high. This does NOT represent the real average of it. You have to sample more humans to be certain about the estimate of the average value of it.
That being said, seed i
tries to get the numbers of the train / val / test splits. Again, you have to "sample" the numbers multiple times with different seeds to be more certain about the mean and the std. of it.
Again that being said. seed i
will train N epochs and choose the checkpoint that has the best validation score, and will evaluate the performance on the test split. Since this is not enough, you'll have to try with different seeds to get the best estimates.
I feel like this is a not good enough answer. I will make a video soon and clarify things.
Thank you so much for your patient answers again!
Do the checkpoint
you said means the hyperparameters?
And this is my understanding:
After seed i
runs N epochs times on the randomly a part of validation set, then select the checkpoint that perform best and apply them to a random part of the test set. And the validation/test set applied in each seed i
is a random part.
Is this right?
I just looked again and found that checkpoints
are the files saved by the model every time it runs.
According to the 5 epochs
I set, each time seed i
runs, the dataset is divided into 5 parts(as shown the blue arrow in figure below), and then the five parts are predicted in turn, and the training results are improved in turn(as shown the red arrow in figure below). However, if these five parts of dataset are randomly selected from the total dataset, why does the result increase in turn?
I think you are confusing with cross validation.
In my training scheme, train / val / test splits are just what they are. They are not further split into folds.
Therefore, the data set used in the N epochs
is the whole validation dataset? But why is the F1 score getting better and better from the first epoch
? If the data sets used are the same in each N epochs
, I think the training results should be similar....
In general, as more backprops are done, the performance metric improves on the training data split. The number of backprops increase with respect to number of steps or number of epochs.
Got it! Thank you!
Therefore, after running seed
i times, we can get i results. The correct evaluation method is to take the one with the highest F1 score as the standard for judging the model or it still need to calculate the average value of i results?
Every seed is its own independent run. As for every seed, you choose the best model that has the best validation evaluation metric. And as for the final results, you report the average of the test evaluation metrics.
I'm sorry to bother you again. Can you tell me that how to determine the setting of "epoch", "batch_size" and "HP_N_Trials" in yaml file? And why "PER_DEVICE_EVAL_BATCH_SIZE = BATCH_SIZE*2"? Thank you very much!