Possible training bugs - Githubissues

Dear authors,

First, thank you for sharing the code!

I've observed 2 possible bugs in the training loop, which I'd appreciate if you could either confirm or explain the reason behind the piece of code.

In https://github.com/sunnweiwei/GenRet/issues/7 you mentioned that you trained using a multi-GPU (single node I guess) configuration with the exception of the testing part code (am I wrong thinking that this part only affects to test and test_dr functions)? I managed to separate the testing part from the multi-GPU training using accelerator.is_local_main_process, but the method safe_save kept returning None for all the processes but for the main process (this leaded to fail the loading of the different components for all the processes but the main process). I think this is a bug. I assumed that it is, and I managed to fix it using accelerate.utils.broadcast_object_list.
In the function train you set epoch_step = len(data_loader) // in_batch_size. If I'm not wrong, you manually split your dataset using BiDataset using in_batch_size, and use the variable batch_size with the pytorch data loader. I think that epoch_step is supposed to contain the number of steps per epoch, but with respect to batch_size, not in_batch_size because we are using the data loader. I observed in the debugging executions I made that not all the steps per epoch are executed, but only a part of the epoch. Specifically, only len(data_loader) // in_batch_size is executed, which is not the 100% of the dataset (the higher in_batch_size, the fewer steps are executed). If I'm not wrong about this bug, this does not affect your conclusions, but your results might be even better because not all the training data is being used. The fix I applied assuming this is a bug is len(data_loader) // batch_size, which since bach_size = 1, basically means that you see all the data batches.

Thank you!

EDIT:

Regarding point 2, if I'm not wrong, if in_batch_size != 1 and step > (epoch + 1) * epoch_step: should be modified to if in_batch_size != 1 and step >= (epoch + 1) * epoch_step: in order to save the model once the epoch is executed.

Again, regarding point 2, the line len(data_loader) // batch_size could be related to https://huggingface.co/docs/accelerate/en/usage_guides/gradient_accumulation (they use something similar: batch_size = len(x) // gradient_accumulation_steps, but not sure if equivalent to your code)? If is the case, again the value would be batch_size (constant with value 1) instead of in_batch_size since you set gradient_accumulation_steps=1, isn't it? Wouldn't this mean that your actual batch size is 128 instead of 256 as you specified in the paper? In the code the default value for the batch size is 128, and for a batch size of 254 you would need gradient_accumulation_steps=2 instead, isn't it?

EDIT:

Sorry about this comment, but I'm not sure yet about the actual batch size when using multi-GPU training: https://discuss.huggingface.co/t/what-is-my-batch-size/41390/2?u=cgr71ii . As far as I understand from the HF accelerate documentation, if your batch size was 128, gradient_accumulation_steps=1, and you used 2 GPUs for the training, then your actual batch size would be 256).

EDIT 2:

Yes, according to the response I received in the HF forum, the actual batch_size should be 256 if you used 2 GPUs with the current training script. I think this comment can be ignored, sorry.

EDIT 3:

In https://github.com/sunnweiwei/GenRet/issues/3#issuecomment-1902629598 you mentioned that you used 8 GPUs. Then your actual batch size was 128 * 8 = 1024? If I'm not wrong, if you used 8 GPUs and your actual batch size was 256, that would mean that you used a batch size of 16 per GPU.

Hi!

I think I found another problem in the training code. When you run config['prev_id'] = f'{checkpoint}.code' if checkpoint is not None else None in the main training loop, the checkpoint is referencing a model from the previous training iteration (i.e., if loop=0, no IDs because return_code_logits = None but config['code_length'] = loop + 1 = 1; if loop=1 then you're loading IDs with length of 1 and config['code_length'] = loop + 1 = 2, ...). This means that all the time we have config['code_length'] = loop + 1, and it works well, but when you finish the main loop and the final training step starts, the line add_last(f'{checkpoint}.code', args.code_num, f'{checkpoint}.code.last') adds an additional ID and we break the config['code_length'] = loop + 1 and we have config['code_length'] = loop, which is not in the code. This raises an error in the line query_code_loss = F.cross_entropy(query_outputs.code_logits.view(-1, code_number), batch['ids'][:, 1:].reshape(-1)), since the shape is different in query_outputs.code_logits.view(-1, code_number) (I'm using a batch size of 200, so this variable gets a shape of 600) and batch['ids'][:, 1:].reshape(-1) (shape of 800). Do you think this is a bug or perhaps I did something wrong?

EDIT:

For this problem I've found 2 different fixes which I don't know if it solves or not the problem, but they are:

Modify loop = args.max_length to loop = args.max_length + 1. I think this "solution" is not the correct one because is not "natural" about how the loop variable has been modified through the for loop. With this solution, the values of the loop variable would be 0, 1, 2 and 4, instead of 0, 1, 2 and 3.
Remove the line add_last(f'{checkpoint}.code', args.code_num, f'{checkpoint}.code.last') and change config['prev_id'] = f'{checkpoint}.code.last' to config['prev_id'] = f'{checkpoint}.code. This affects to the code in the way that we do not have 4 IDs per document/query but 3, which is the value that is described in the original paper. Why should this line be executed? What's its purpose? In fact, this add_last function is adding a 4th value to the docids which is not semantic but arbitrary since is counting if the specific docid is being assigned to different documents (or querys, not sure which one), and lastly it applies module 512 in order to be in the range [0, 512).

Both options solves the problem that raises from the situation I explained in the original content of this comment. The problem is that I don't know how these changes affect to the training.

EDIT 2:

Sorry, I said that the length of docids is 3 according to the paper, but that is for the example of the Figure 1. Anyway, in the paper is stated that "K = 512 for all datasets, with the length of the docid M being dependent on the number of documents present". Am I wrong thinking that --max_length specifies the length of the docid? In that case, even if a number different of 3 (default value) is configured, I think that the described problem in this comment also applies. If --max_length is M as described in the paper, if add_last is executed, the docid files will contain 4 instead of 3 numbers.

Hi,

In https://github.com/sunnweiwei/GenRet/issues/7 you mentioned that you trained using a multi-GPU (single node I guess) configuration with the exception of the testing part code (am I wrong thinking that this part only affects to test and test_dr functions)? I managed to separate the testing part from the multi-GPU training using accelerator.is_local_main_process, but the method safe_save kept returning None for all the processes but for the main process (this leaded to fail the loading of the different components for all the processes but the main process). I think this is a bug. I assumed that it is, and I managed to fix it using accelerate.utils.broadcast_object_list.

Yes, I train on a single node. In multi-GPU training, I simply test on a separate process that uses only one GPU. I have not checked accelerate.utils.broadcast_object_list, but this may be a better solution.

In the function train you set epoch_step = len(data_loader) // in_batch_size. If I'm not wrong, you manually split your dataset using BiDataset using in_batch_size, and use the variable batch_size with the pytorch data loader. I think that epoch_step is supposed to contain the number of steps per epoch, but with respect to batch_size, not in_batch_size because we are using the data loader. I observed in the debugging executions I made that not all the steps per epoch are executed, but only a part of the epoch. Specifically, only len(data_loader) // in_batch_size is executed, which is not the 100% of the dataset (the higher in_batch_size, the fewer steps are executed). If I'm not wrong about this bug, this does not affect your conclusions, but your results might be even better because not all the training data is being used. The fix I applied assuming this is a bug is len(data_loader) // batch_size, which since bach_size = 1, basically means that you see all the data batches.

First of all, I define 1 epoch as the model going through len(data_loader) data samples. Yes, for each epoch, the data_loader loop runs only len(data_loader) // in_batch_size times, instead of len(data_loader) times. This is because in the __getitem__ method of BiDataset, I sample in_batch_size samples from the whole training dataset, and thus output in_batch_size samples instead of 1 sample. Then, 1 epoch should loop through the data_loader len(data_loader) // in_batch_size times to get len(data_loader) samples, not len(data_loader) times. Does this mean not all data are used for training? I believe the answer is no. Because the data_loader is set to shuffle=True, and each loop randomly samples in_batch_size samples from the whole dataset, we can say that for each epoch, len(data_loader) sample of data are randomly sampled from the whole training set. For multi-GPU training (with gradient accumulation), the number len(data_loader) // in_batch_size should still work. To sum up, in_batch_size is the real batch size.

I think I found another problem in the training code. When you run config['prev_id'] = f'{checkpoint}.code' if checkpoint is not None else None in the main training loop, the checkpoint is referencing a model from the previous training iteration (i.e., if loop=0, no IDs because return_code_logits = None but config['code_length'] = loop + 1 = 1; if loop=1 then you're loading IDs with length of 1 and config['code_length'] = loop + 1 = 2, ...). This means that all the time we have config['code_length'] = loop + 1, and it works well, but when you finish the main loop and the final training step starts, the line add_last(f'{checkpoint}.code', args.code_num, f'{checkpoint}.code.last') adds an additional ID and we break the config['code_length'] = loop + 1 and we have config['code_length'] = loop, which is not in the code. This raises an error in the line query_code_loss = F.cross_entropy(query_outputs.code_logits.view(-1, code_number), batch['ids'][:, 1:].reshape(-1)), since the shape is different in query_outputs.code_logits.view(-1, code_number) (I'm using a batch size of 200, so this variable gets a shape of 600) and batch['ids'][:, 1:].reshape(-1) (shape of 800). Do you think this is a bug or perhaps I did something wrong?

Sorry, I said that the length of docids is 3 according to the paper, but that is for the example of the Figure 1. Anyway, in the paper is stated that "K = 512 for all datasets, with the length of the docid M being dependent on the number of documents present". Am I wrong thinking that --max_length specifies the length of the docid? In that case, even if a number different of 3 (default value) is configured, I think that the described problem in this comment also applies. If --max_length is M as described in the paper, if add_last is executed, the docid files will contain 4 instead of 3 numbers.

I am not sure if the problem arises from your modification of the batch size. I add an extra ID to distinguish docs with same IDs. This extra ID may be helpful when many documents have conflicting IDs, although ideally, the conflict rate should be low. Besides lean extra ID, the final step of training is still necessary as it trains the model to memorize the first 3 semantic IDs in a teacher-forcing manner (using code_loss). Without this step, the 3rd ID will only be trained with ce_loss, which provides a dynamic target based on the model's on-the-fly predictions, instead of a stable target as in code_loss. Also, I find that training more steps in a teacher-forcing manner always yields better results. Therefore, even though we wnat ID length to be 3, the model still needs to be trained with the final steps, and an extra 4th ID is added for the sake of code consistency.

Hi!

First, thank you very much for your time and responses.

Regarding my modifications to the code, I agree that I might have added added some lines that might change the results and generate some of the problems that I identified. For that reason, I started again from a clean version of run.py and executed until I did not obtain any error. Please, find attached the file run2.py and my data files in case that you might want to execute the file or use the data examples I provide to run with run.py and either verify or reject my claims about the errors. The run2.py version has also been slightly modified to run remove some warnings and run only 1 epoch in order to obtain the results faster for debugging purposes.

The diff between run.py (original file) and run2.py (my modified versions) is:

8c8
< from accelerate import Accelerator
---
> import accelerate
12c12
< from transformers.generation_utils import GenerationMixin
---
> from transformers import GenerationMixin
31a32
> import datetime
592d592
< 
596c596,597
<     if accelerator.is_local_main_process and epoch < end_epoch and epoch % save_step == 0:
---
>     last_checkpoint = ["placeholder"]
>     if accelerator.is_local_main_process:
603,604c604,606
<         last_checkpoint = f'{save_path}/{epoch}.pt'
<     return epoch + 1, last_checkpoint
---
>         last_checkpoint = [f'{save_path}/{epoch}.pt']
> 
>     accelerator.wait_for_everyone()
605a608,611
>     last_checkpoint = accelerate.utils.broadcast_object_list(last_checkpoint, from_process=0)
>     last_checkpoint = last_checkpoint[0]
> 
>     return last_checkpoint
617c623,626
<     accelerator = Accelerator(gradient_accumulation_steps=1)
---
>     kwargs = []
>     kwargs.append(accelerate.InitProcessGroupKwargs(timeout=datetime.timedelta(days=30))) # Wait the eval!
>     kwargs.append(accelerate.DistributedDataParallelKwargs(find_unused_parameters=False))
>     accelerator = accelerate.Accelerator(gradient_accumulation_steps=1, kwargs_handlers=kwargs)
695,696d703
<     step, epoch = 0, 0
<     epoch_step = len(data_loader) // in_batch_size
700c707
<     for _ in range(epochs):
---
>     for epoch in range(epochs):
707d713
<             step += 1
721,728c727
<                 if in_batch_size != 1 and step > (epoch + 1) * epoch_step:
<                     epoch, last_checkpoint = safe_save(accelerator, model, save_path, epoch, end_epoch=end_epoch,
<                                                        save_step=save_step,
<                                                        last_checkpoint=last_checkpoint)
<                 if epoch >= end_epoch:
<                     break
<         if in_batch_size == 1:
<             epoch = safe_save(accelerator, model, save_path, epoch, end_epoch=end_epoch, save_step=save_step)
---
>         last_checkpoint = safe_save(accelerator, model, save_path, epoch)
730c729
<     return last_checkpoint
---
>     return last_checkpoint, accelerator
828a828,829
>         model = model.cuda()
>         model.eval()
878,879c879,883
<         print(eval_all([predictions[j] for j in seen_split], [query_ids[j] for j in seen_split]))
<         print(eval_all([predictions[j] for j in unseen_split], [query_ids[j] for j in unseen_split]))
---
> 
>         if len(seen_split) > 0:
>             print(eval_all([predictions[j] for j in seen_split], [query_ids[j] for j in seen_split]))
>         if len(unseen_split) > 0:
>             print(eval_all([predictions[j] for j in unseen_split], [query_ids[j] for j in unseen_split]))
1134a1139
>     parser.add_argument('--batch_size', type=int, default=128)
1153c1158,1159
<         config['epochs'] = 1 if loop == 0 else 10
---
>         #config['epochs'] = 1 if loop == 0 else 10 # TODO use this
>         config['epochs'] = 1 if loop == 0 else 1
1155,1156c1161,1166
<         checkpoint = train(config)
<         test_dr(config)
---
>         checkpoint, accelerator = train(config)
> 
>         if accelerator.is_local_main_process:
>             test_dr(config)
> 
>         accelerator.wait_for_everyone()
1161c1171,1172
<         config['epochs'] = 200
---
>         #config['epochs'] = 200 # TODO use this
>         config['epochs'] = 1
1163,1164c1174
<         checkpoint = train(config)
<         test_dr(config)
---
>         checkpoint, accelerator = train(config)
1166c1176,1180
<         test(config)
---
>         if accelerator.is_local_main_process:
>             test_dr(config)
>             test(config)
> 
>         accelerator.wait_for_everyone()
1172,1174c1186,1190
<     add_last(f'{checkpoint}.code', args.code_num, f'{checkpoint}.code.last')
<     config['prev_id'] = f'{checkpoint}.code.last'
<     config['epochs'] = 1000
---
>     #add_last(f'{checkpoint}.code', args.code_num, f'{checkpoint}.code.last')
>     #config['prev_id'] = f'{checkpoint}.code.last'
>     config['prev_id'] = f'{checkpoint}.code'
>     #config['epochs'] = 1000 # TODO use this
>     config['epochs'] = 1
1176,1178c1192,1200
<     checkpoint = train(config)
<     test_dr(config)
<     test(config)
---
> 
>     checkpoint, accelerator = train(config)
> 
>     if accelerator.is_local_main_process:
>         test_dr(config)
>         test(config)
> 
>     accelerator.wait_for_everyone()

The changes I added in run2.py in order to solve the problems I faced are:

While training, the model was saved several times after each step when epoch_step was reached (i.e., in the same epoch, the model was saved several times with only 1 step of difference). Now is only saved once per epoch.
The call to add_last has been removed. The call to this function is adding an additional docid which I'm not saying if it's necessary or not for the methodology, I'm only saying that in my execution is causing errors and it is because the value of config['prev_id'] references to loop - 1 in all the iterations of the main loop but in the last, which references to loop, causing an error in the shape of the tensors of the line query_code_loss = F.cross_entropy(query_outputs.code_logits.view(-1, code_number), batch['ids'][:, 1:].reshape(-1)). Specifically, batch['ids'] does not have the correct shape because it contains in the last execution of train contains a shape that it should contain when loop += 1 instead of the current value of the variable loop when the main is finished, which is loop = args.max_length.

I don't know if I'm having this errors because I'm trying to execute run.py from end-to-end instead of step by step as you mentioned in other issues that you executed (i.e., first the training, then test, again training, ...), but in my case running run.py end-to-end is not working.

Command I execute:

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" accelerate launch run2.py --model_name "google-t5/t5-small" --code_num 512 --max_length 3 --train_data ../europarl/europarl.2000.genret.train.50perc.json --dev_data ../europarl/europarl.2000.genret.dev.50perc.json --corpus_data ../europarl/europarl.2000.genret.corpus.json --save_path out.sentences.europarl/model4

Executing run2.py I managed to finish the execution without errors.

Again, thank you.

PS: I'm aware that my data is not query/documents pairs but parallel sentences. If I'm not wrong, this should not be a problem.

run2.zip europarl.2000.genret.corpus.json europarl.2000.genret.dev.50perc.json europarl.2000.genret.train.50perc.json europarl.2000.genret.train.50perc.json.qg.json

When I run the original code (I only modified the call to safe_save in order to store the model once per epoch; this shouldn't affect to the result) with a batch size of 100 (128 didn't fit in my GPU with the NQ320K data) in 1 GPU to avoid the multi-GPU configuration I get the following error:

Traceback (most recent call last):
  File "/home/cgarcia/Documents/GenRet/original_run_test_min.py", line 1193, in <module>
    main()
  File "/home/cgarcia/Documents/GenRet/original_run_test_min.py", line 1186, in main
    checkpoint = train(config)
  File "/home/cgarcia/Documents/GenRet/original_run_test_min.py", line 709, in train
    losses = OurTrainer.train_step(model, batch, gathered=False)
  File "/home/cgarcia/Documents/GenRet/original_run_test_min.py", line 420, in train_step
    query_code_loss = F.cross_entropy(query_outputs.code_logits.view(-1, code_number),
  File "/home/cgarcia/miniconda3/envs/python/lib/python3.10/site-packages/torch/nn/functional.py", line 3059, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (300) to match target batch_size (400).

This is the error I tried to explain I was getting due to add_last.

The command I ran is the one is specified in the README:

CUDA_VISIBLE_DEVICES="0" python3 original_run_test_min.py --model_name t5-base --code_num 512 --max_length 3 --train_data dataset/nq320k/train.json --dev_data dataset/nq320k/dev.json --corpus_data dataset/nq320k/corpus_lite.json --save_path out/model

Regarding the changes I said on my last comment about the batch size now I see that it's not a but it's managed the way it is because the BiDataset class handles the batch size. Then epoch_step = len(data_loader) // in_batch_size has to be used in order to break and do not see multiple epochs of data in only 1 epoch.

sunnweiwei / GenRet

Possible training bugs #11