The training loss(logging steps) will drop suddenly after each epoch? Help me plz! Orz

lchustc commented 2 years ago

System Info

transformers version: 4.17.0 Python version: 3.7.0 torch version: 1.10.1

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

CLIP(https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).

I have implemented a Dataset to train, but i have found that after each epoch the training loss will drop suddenly. The Dataset overrides three methods(init, getitem and len) and i couldn't figure out the reason for the above phenomenon.

I think the data is shuffled properly(checked) and the learning_rate drops smoothly(observed). I would appreciate it if you could afford time to help me.

The picture is drawn according to the trainer_state.json

Expected behavior

Figure out the reason.

lchustc commented 2 years ago

And i also found that the loss(last step) of each epoch may drop suddenly when the size of dataset isn't an integer multiple of batch size. Because the clip_loss is up to batch and the last batch will be duplicate unless "dataloader_drop_last" is set.

lchustc commented 2 years ago

@ydshieh

ydshieh commented 2 years ago

For my experience, it could be that the loss is calculated as the average of losses in the steps in each epoch. For example, suppose there are 3 epochs, each epoch has 1000 steps. In the 1st epoch, the loss is calculated as the average among the steps that have been done. When the 2nd epoch starts, it resets the loss. The loss in the 1st step in the 2nd epoch is not the average of all previous steps (in the 1st epoch) and the current step, but just a loss over the single batch.

If this is the case, the loss picture you shared is normal.

Could you check if this is the case in your training? You might have to read the source code (along with your provided training arguments) to verify.

Otherwise, could you share your training arguments, so we can have a look? Thanks.

lchustc commented 2 years ago

Thanks. The Trainer does reset the loss in function "_maybe_log_save_evaluate". But I still don't understand this phenomenon, because i get smooth loss curve when i train BERT with Trainer and Dataset. Anyway, I'll figured it out myself, thank a lot!

lchustc commented 2 years ago

Hello, i have read the source code carefully(line 2025 in https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py) but found the loss is reset at every log step rather than the start of every epoch. Here is my loss curve of training BERT(https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling).

I think the loss of BERT and CLIP should be both smooth or both not when using Trainer. My training arguments are as follows:

ydshieh commented 2 years ago

Hi @lchwhut Do you use the same training arguments for both BERT and CLIP (other than the training script and datasets of course)? Could you also share the one you used for BERT training?

lchustc commented 2 years ago

HI @ydshieh Almost the same, here is training arguments for BERT:

I also trained CLIP according to readme(https://github.com/huggingface/transformers/blob/main/examples/pytorch/contrastive-image-text/README.md) and obtained unsmooth loss curve(But the loss of BERT is smooth). Could you spare some time to train CLIP and check the loss in trainer_state.json? Thanks

lchustc commented 2 years ago

It seems if i train BERT demo(run_mlm.py https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling) without any modification(other than dataset) i will get smooth loss curve. But if i train CLIP demo(run_clip.py https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) without any modification i will get unsmooth loss curve.

ydshieh commented 2 years ago

@lchwhut It is not clear to me what could cause this difference. I see you log on each step while training CLIP. And the training loss is reset once the log is performed.

Do you have the logged values for each step? Could you see at which steps the loss value dropped down? Do those steps correspond to the end of epochs?

I would suggest you to ask the question on the forum to see if someone have faced the same issue and figured out the cause.

ydshieh commented 2 years ago

I found you mentioned earlier:

And i also found that the loss(last step) of each epoch may drop suddenly when the size of dataset isn't an integer multiple of batch size. Because the clip_loss is up to batch and the last batch will be duplicate unless "dataloader_drop_last" is set.

Could you explain a bit what this means the last batch will be duplicate? Usually the Trainer class or the datasets won't duplicate the examples.

lchustc commented 2 years ago

I specified logging_steps=1 and dataloader_drop_last=True while training CLIP on a small dataset and it's clear to see i have trained 10 epochs according to the loss curve. The loss curve was almost the same when i trained CLIP demo without modification yesterday.

I checked the logged values for each step and found the loss value(dataloader_drop_last=True) dropped significantly at the first step of each epoch, details are below:

lchustc commented 2 years ago

Forget what i said about "the last batch will be duplicate". Really sorry about that. Maybe if i specify dataloader_drop_last=False, the number of samples in the last batch will be not equal to batch_size(less than). In this case, the clip_loss will drop significantly because it is greatly affected by batch size.

ydshieh commented 2 years ago

Hi, I think this probably depends on the dataset (or more precisely, the way you use the dataset), rather than an issue in the Trainer class or the training script.

It would be better to post it on the forum.

(Also, as you mentioned you overrides some dataset methods, it would be nice to share what changes you have done there. Otherwise, try to see if you can reproduce the same situation with the original training script run_clip.py.)

lchustc commented 2 years ago

Hello, I posted it on the forum(https://discuss.huggingface.co/t/the-training-loss-logging-steps-will-drop-suddenly-after-each-epoch-help-me-plz-orz/22129/2)

Actully, I have trained CLIP without any modification(original training script run_clip.py) and obtained unsmooth loss curve. Can you try to train CLIP following the readme(https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text)? I think you will also reproduce the same situation. For your convenience, i post my loss curve of CLIP demo(original training script run_clip.py)

I am glad to share my code:

n9Mtq4 commented 2 years ago

I've encountered the same thing. My hypothesis is the drop in loss on epoch boundaries is an indication that your model is memorizing the training data. Interesting, it's not necessarily overfitting, as in my case, the validation loss continued to drop, just not as fast as the training loss.

My reasoning is as follows:

Consider training step $i$. The loss will be lower for step $i+1$ iff what the model has learned in step $i$ generalizes and helps it make a better prediction for step $i+1$. This happens early on during training, which is why your loss decreases in the middle of the first 2 epochs.

But, once the model has learned all the generalizable information it can, what it has learned in step $i$ will not help with step $i+1$. When this happens, the loss will plateau. It will still memorize the data, so when it sees the same data point again in the next epoch, the loss will be lower. This is why there is a sharp drop in loss at the start of each epoch.

I also hypothesize that if the loss is increasing in the middle of an epoch that would indicate that the learning from step $i$ hurts the prediction for step $i+1$ and is an indication of overfitting.

I would suggest looking at the validation loss to make sure you aren't overfitting.

lchustc commented 2 years ago

Thanks for your sharing.

I still don't understand what you said about your model is memorizing the training data(or more precisely, why this cause sharp drop in loss only at the start of each epoch). I think if there is only a sharp drop in loss at the start of each epoch, this means the model only memorizes the data at the start of each epoch? But the Trainer shuffles the data every epochs, so i can't figure it out. Could you explain mor about that? Another phenomenon (the train loss of BERT demo drop smoothly and both CLIP and BERT utilize Trainer and load_dataset) I found made me think that the sharp drop in loss at the start of each epoch may be related to the calculation of the CLIP_LOSS.

As you can see above: the loss of last step of each epoch would drop more significantly(sometime could be zero when the bs of last step is 1) when the size of dataset is not an integer multiple of batch_size, because the CLIP_LOSS is up to batch_size. In my opinion, there could be a variable (or sth else) that changes along with the epochs. That means the variable become smaller when new epoch comes, and then multiply with loss. I am checking the calculation of the CLIP_LOSS now (line 386: https://github.com/huggingface/transformers/blob/main/src/transformers/models/vision_text_dual_encoder/modeling_vision_text_dual_encoder.py)

n9Mtq4 commented 2 years ago

I don't think it's related to CLIP as I've seen this happen with multiple models. Here's the training loss with OPT-350M. OPT-1.3B and GPT-neo-125M also had this behavior. The larger the model, the larger the loss drops were. Unfortunately, I no longer have the tensorboard logs for those runs. I've also seen this to a lesser degree with a large MLP and my own pytorch training loop, so I don't think it's an issue with HF transformers.

opt-350m training loss

The model is learning every at step, not just the start of the epoch. Consider having a dataset of: a, b, c and learning from one data point doesn't help improve the predictions for the others (Ex If learning from data point a doesn't help improve the prediction on data points b and c). Let's look at what training a model on this dataset would look like.

For each epoch the data is shuffled, but it will go through all the data before repeating. So for 4 epochs, an example training session could look like this:

Step	Epoch	Data Point	Number of times the model has seen the data point	Loss
1	1	c	0	5
2	1	b	0	5
3	1	a	0	5
4	2	b	1	4
5	2	a	1	4
6	2	c	1	4
7	3	c	2	3
8	3	b	2	3
9	3	a	2	3
10	4	a	3	2
11	4	b	3	2
12	4	c	3	2

Notice that during any single epoch, since one data point doesn't improve the predictions for others, the loss stays the same. But as the model trains, it remembers the correct prediction for each data point, so that when it sees it again (which happens in the next epoch) it will produce a better prediction for that specific data point. So the loss for any given data point is correlated with how many times it has seen that specific data point which increments at the start of each epoch.

As for why it's not happening with your BERT model, perhaps the model is too small, you have sufficient data to prevent memorization, or the dataset doesn't have this property.

I'll point out again that this is my best guess to why this is happening and I haven't done any experimentation to confirm that this is the reason. You could try training by sampling your dataset with replacement so that a single data point could appear multiple times in the same epoch. I would expect that the drop in loss at epoch starts wouldn't be visible, although the memorization would still occur.

lchustc commented 2 years ago

I don't think it's related to CLIP as I've seen this happen with multiple models. Here's the training loss with OPT-350M. OPT-1.3B and GPT-neo-125M also had this behavior. The larger the model, the larger the loss drops were. Unfortunately, I no longer have the tensorboard logs for those runs. I've also seen this to a lesser degree with a large MLP and my own pytorch training loop, so I don't think it's an issue with HF transformers.

The model is learning every at step, not just the start of the epoch. Consider having a dataset of: a, b, c and learning from one data point doesn't help improve the predictions for the others (Ex If learning from data point a doesn't help improve the prediction on data points b and c). Let's look at what training a model on this dataset would look like.

For each epoch the data is shuffled, but it will go through all the data before repeating. So for 4 epochs, an example training session could look like this:

Step Epoch Data Point Number of times the model has seen the data point Loss 1 1 c 0 5 2 1 b 0 5 3 1 a 0 5 4 2 b 1 4 5 2 a 1 4 6 2 c 1 4 7 3 c 2 3 8 3 b 2 3 9 3 a 2 3 10 4 a 3 2 11 4 b 3 2 12 4 c 3 2 Notice that during any single epoch, since one data point doesn't improve the predictions for others, the loss stays the same. But as the model trains, it remembers the correct prediction for each data point, so that when it sees it again (which happens in the next epoch) it will produce a better prediction for that specific data point. So the loss for any given data point is correlated with how many times it has seen that specific data point which increments at the start of each epoch.

As for why it's not happening with your BERT model, perhaps the model is too small, you have sufficient data to prevent memorization, or the dataset doesn't have this property.

I'll point out again that this is my best guess to why this is happening and I haven't done any experimentation to confirm that this is the reason. You could try training by sampling your dataset with replacement so that a single data point could appear multiple times in the same epoch. I would expect that the drop in loss at epoch starts wouldn't be visible, although the memorization would still occur.

@n9Mtq4 Thank you so much for affording time to explain in detail! If this issue happens to multiple models and your own pytorch training loop, i think your reasoning is very reasonable!!! The HF BERT demo doesn't reproduce the same situation, i guess it's because the data allcator HF used randomly masks the tokens every epoch. This means the HF BERT can hardly see the same data in a new epoch.

@ydshieh

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kingnobro commented 1 year ago

Hi, I also meet the same problem. How did you solve this? Thanks.

weberxie commented 1 year ago

Hi, have you solved this problem? Thanks.

kingnobro commented 1 year ago

Hi, have you solved this problem? Thanks.

Yes. I think the main cause is data representation. Specifically speaking, your current representation of your data is not good, so the model cannot learn data distribution correctly. Last time, I successfully found another way to represent my data, then this problem was solved.

How to modify the data representation? I think it's closely related to your own problem.

yuanzhedong commented 1 year ago

I've seen the same issue when train the 7B model with stanford alpaca:

pinkponk commented 12 months ago

Could also be the case of running means when calculating the losses/metrics. I know people here are using Pytorch and I dont know how it is handled in pytorch but in Keras the losses/metrics are all aggregated using running means.

see https://stackoverflow.com/questions/72058858/moving-averaging-of-loss-during-training-in-keras/77322320#77322320

@yuanzhedong you log graph does not look like a running mean as it is not declining during the epoch which needs to happen if it should show the step like behavior

b96705008 commented 10 months ago

Just wondering if you were using HuggingFace IterableDataset or original Dataset? Not sure if this matters.

lchustc commented 10 months ago

Just wondering if you were using HuggingFace IterableDataset or original Dataset? Not sure if this matters.

I've tried both, and encountered the same situation.

pinkponk commented 10 months ago

This is most likely an artifact from using running averages when logging the loss. I had the same issues using Keras. Try to make sure you are plotting the real batch raw loss if you want to see the noisy true loss.

Basically, this is probably just a logging issue.

davidsvaughn commented 1 month ago

This is often accompanied by the same "staircasing" effect in the validation loss, but in the opposite (upward) direction. See: https://discuss.huggingface.co/t/why-my-training-loss-drops-at-epoch-boundaries/14431/2

Apparently turning off train data re-shuffling (re-shuffling after each epoch) can mitigate this effect, but it's unclear why. I have verified this solution works, at least on my data. However, not sure if it has any negative side-effects, since re-shuffling is generally considered a good idea with SGD (and variants) optimization. I thought it helped not get stuck in local minima, so I wonder if the min-loss would be higher without re-shuffling...

To turn off shuffling in Trainer class, you can subclass Trainer and override the get_train_dataloader method (see: https://discuss.huggingface.co/t/non-shuffle-training/6986/2). For example:

def get_train_dataloader(self):
    return torch.utils.data.DataLoader(
        self.train_dataset,
        batch_size=self.args.train_batch_size,
        sampler=self._get_train_sampler(),
        collate_fn=self.data_collator,
        drop_last=self.args.dataloader_drop_last,
        num_workers=self.args.dataloader_num_workers,
        pin_memory=self.args.dataloader_pin_memory,
    )
def _get_train_sampler(self):
    if self.distributed_training:
        return torch.utils.data.distributed.DistributedSampler(self.train_dataset, shuffle=False)
    return torch.utils.data.SequentialSampler(self.train_dataset)

huggingface / transformers