Data loader bottlenecking training

JakobLindscheid commented 2 months ago

Hi, Thank you for publishing the pretraining and finetuning scripts! They are really helpful. For a university project, we are trying to reproduce the results from the paper. However, running the pretrain script, we observe very slow training speeds (~1 minute per epoch) on our hardware. Running the pytorch profiler for 16 training batches, we see the following:

FIT Profiler Report (relevant lines)

Action | Mean duration (s) | Num calls | Total time (s) | Percentage % --- | --- | --- | --- | --- Total | - | 1397 | 99.734 | 100 % run_training_epoch | 91.657 | 1 | 91.657 | 91.901 | [_TrainingEpochLoop].train_dataloader_next | 5.2931 | 16 | 84.689 | 84.915 | [_EvaluationLoop].val_next | 0.246 | 19 | 4.674 | 4.6865 | | [LightningModule]LagLlamaLightningModule.optimizer_step | 0.10731 | 16 | 1.717 | 1.7216 | | run_training_batch | 0.10731 | 16 | 1.717 | 1.7216 | | [Strategy]SingleDeviceStrategy.training_step | 0.091875 | 16 | 1.47 | 1.4739 | | [Strategy]SingleDeviceStrategy.validation_step | 0.044368 | 19 | 0.843 | 0.84525 | | [Strategy]SingleDeviceStrategy.backward | 0.0135 | 16 | 0.216 | 0.21658 | | [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_train_epoch_end | 0.141 | 1 | 0.141 | 0.14138 | | [Callback]ModelCheckpoint{'monitor': 'val_loss', 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}.on_train_epoch_end | 0.093 | 1 | 0.093 | 0.093248 | | [LightningModule]LagLlamaLightningModule.transfer_batch_to_device | 0.0022286 | 35 | 0.078 | 0.078208 | | [Strategy]SingleDeviceStrategy.batch_to_device | 0.0022286 | 35 | 0.078 | 0.078208 | | [LightningModule]LagLlamaLightningModule.on_validation_model_train | 0.008 | 2 | 0.016 | 0.016043 | | [Callback]ModelSummary.on_fit_start | 0.015 | 1 | 0.015 | 0.01504 | | [Callback]TQDMProgressBar.on_validation_batch_end | 0.00078947 | 19 | 0.015 | 0.01504 | | [LightningModule]LagLlamaLightningModule.optimizer_zero_grad | 0.0009375 | 16 | 0.015 | 0.01504 |

Apparently the data loader needs 5 seconds for each batch, which is 84% of the full time of the training step. After some further investigation, we found that the train data loader does the following:

Apply the transformation to a full time series.
Sample a window from the transformed data (inside the InstanceSplitter).
Extract the window from the transformed data (InstanceSplitter).
Create the batches of data according to the batch size.

This means a full timeseries gets transformed and then most of the transformed data is not used. This is then done for each item in a batch. We observed ~10 ms for transforming a full timeseries and with a batch size of 512, we get the >5 seconds reported by the profiler.

The order of execution is partly given by the gluonts package. So I am not aware of an obvious solution without addressing it there.

Now my question. Did you face the same issue during your experiments? How can we solve the problem we observe?

ashok-arjun commented 2 months ago

Hi @JakobLindscheid !

Thanks for the detailed issue!

I was not aware of this issue as I never checked the data loading speed in my experiments.

Can I check this on my end and get back to you soon?

JakobLindscheid commented 2 months ago

Sure, thank you for having a look! For now, I added a data = list(data) before the instance splitter is applied. This forces the transformation to be done before training starts. Obviously it's not the nicest solution ever since it takes a few minutes before training starts, but the total training time is improved a lot.

ashok-arjun commented 2 months ago

That's useful to know, thanks for sharing.

time-series-foundation-models / lag-llama

Data loader bottlenecking training #51