Open JakobLindscheid opened 2 months ago
Hi @JakobLindscheid !
Thanks for the detailed issue!
I was not aware of this issue as I never checked the data loading speed in my experiments.
Can I check this on my end and get back to you soon?
Sure, thank you for having a look!
For now, I added a data = list(data)
before the instance splitter is applied. This forces the transformation to be done before training starts. Obviously it's not the nicest solution ever since it takes a few minutes before training starts, but the total training time is improved a lot.
That's useful to know, thanks for sharing.
Hi, Thank you for publishing the pretraining and finetuning scripts! They are really helpful. For a university project, we are trying to reproduce the results from the paper. However, running the pretrain script, we observe very slow training speeds (~1 minute per epoch) on our hardware. Running the pytorch profiler for 16 training batches, we see the following:
FIT Profiler Report (relevant lines)
Action | Mean duration (s) | Num calls | Total time (s) | Percentage % --- | --- | --- | --- | --- Total | - | 1397 | 99.734 | 100 % run_training_epoch | 91.657 | 1 | 91.657 | 91.901 | [_TrainingEpochLoop].train_dataloader_next | 5.2931 | 16 | 84.689 | 84.915 | [_EvaluationLoop].val_next | 0.246 | 19 | 4.674 | 4.6865 | | [LightningModule]LagLlamaLightningModule.optimizer_step | 0.10731 | 16 | 1.717 | 1.7216 | | run_training_batch | 0.10731 | 16 | 1.717 | 1.7216 | | [Strategy]SingleDeviceStrategy.training_step | 0.091875 | 16 | 1.47 | 1.4739 | | [Strategy]SingleDeviceStrategy.validation_step | 0.044368 | 19 | 0.843 | 0.84525 | | [Strategy]SingleDeviceStrategy.backward | 0.0135 | 16 | 0.216 | 0.21658 | | [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_train_epoch_end | 0.141 | 1 | 0.141 | 0.14138 | | [Callback]ModelCheckpoint{'monitor': 'val_loss', 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_train_epoch_end | 0.093 | 1 | 0.093 | 0.093248 | | [LightningModule]LagLlamaLightningModule.transfer_batch_to_device | 0.0022286 | 35 | 0.078 | 0.078208 | | [Strategy]SingleDeviceStrategy.batch_to_device | 0.0022286 | 35 | 0.078 | 0.078208 | | [LightningModule]LagLlamaLightningModule.on_validation_model_train | 0.008 | 2 | 0.016 | 0.016043 | | [Callback]ModelSummary.on_fit_start | 0.015 | 1 | 0.015 | 0.01504 | | [Callback]TQDMProgressBar.on_validation_batch_end | 0.00078947 | 19 | 0.015 | 0.01504 | | [LightningModule]LagLlamaLightningModule.optimizer_zero_grad | 0.0009375 | 16 | 0.015 | 0.01504 |Apparently the data loader needs 5 seconds for each batch, which is 84% of the full time of the training step. After some further investigation, we found that the train data loader does the following:
This means a full timeseries gets transformed and then most of the transformed data is not used. This is then done for each item in a batch. We observed ~10 ms for transforming a full timeseries and with a batch size of 512, we get the >5 seconds reported by the profiler.
The order of execution is partly given by the gluonts package. So I am not aware of an obvious solution without addressing it there.
Now my question. Did you face the same issue during your experiments? How can we solve the problem we observe?