Closed Zhanlo closed 7 months ago
If you are using distributed training, i.e., multiprocessing_distributed set to True, num_train_iter and epoch jointly determine the training iterations per epoch as num_train_iter // epoch. If multiprocessing_distributed is set to False, number of training iterations per epoch would be the length of data loader, the num_train_iter doesn't need to be manually set.
Thank you for the clarification. Upon reviewing the code, I found that if it is not for distributed training, i.e., setting multiprocessing_distributed
to False, in the get_data_loader()
method inside build.py
, the default setting will calculate num_samples = num_train_iter // epoch * batch_size
, which defaults to 1024*64
. This will affect the size of the sample quantity of the dataloader's sampler, that is, the size of the sample quantity involved in each epoch during the actual training process. If num_samples
exceeds the size of the specified dataset, it will result in repeated sampling operations. Am I understanding this correctly?
Stale issue message
About configs
I would like to customize the usage of datasets and the duration of model training. Could you please tell me how to determine some of the hyperparameters in the config file, such as the relationship between the number of epochs,
num_train_iter
,num_eval_iter
, and batch size? I'm sorry I couldn't find any relevant explanations. Thank you!