Closed mbreuss closed 9 months ago
Hi @mbreuss, did you maybe run the shared memory of a smaller debug dataset before?
Try to delete the shared memory in /dev/shm/
, they are called /dev/shm/train_*
and /dev/shm/val_*
.
Also delete the train_shm_lookup.npy
and the val_shm_lookup.npy
in tmp or slurm_temp directory (see here).
It's weird that it takes so long without the shared memory, there definitely seems to be something wrong. Shared memory gave us a speed-up of max. 50%. From where do you load the dataset, is it maybe via a slow network mount? Try running the PyTorch-lightning profiler to see where's the bottleneck, you can also post the results here and I compare them to our cluster. For debugging, you might want to use the hydra cmd line flags +trainer.limit_train_batches=10
and +trainer.limit_val_batches=10
(or a similar number), then you don't have to wait for the whole epoch, but you still get an estimate of the time.
Hi,
thanks for the tips. This is the result of using the simple profiler with 10 train and validation batches:
FIT Profiler Report
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Action | Mean duration (s) | Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Total | - | 1981 | 163.14 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| run_training_epoch | 67.911 | 2 | 135.82 | 83.255 |
| [Strategy]DDPStrategy.batch_to_device | 0.66106 | 42 | 27.764 | 17.019 |
| run_training_batch | 0.62769 | 20 | 12.554 | 7.6951 |
| [LightningModule]GCBC.optimizer_step | 0.62632 | 20 | 12.526 | 7.6784 |
| [Strategy]DDPStrategy.backward | 0.578 | 20 | 11.56 | 7.0859 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_epoch_end | 1.6921 | 2 | 3.3842 | 2.0744 |
| [Strategy]DDPStrategy.validation_step | 0.096068 | 22 | 2.1135 | 1.2955 |
| [Strategy]DDPStrategy.training_step | 0.038656 | 20 | 0.77312 | 0.4739 |
| on_train_batch_end | 0.0028381 | 20 | 0.056762 | 0.034793 |
| [LightningModule]GCBC.on_fit_start | 0.05676 | 1 | 0.05676 | 0.034792 |
| [Callback]TQDMProgressBar.on_validation_batch_end | 0.0023747 | 22 | 0.052243 | 0.032023 |
| [LightningModule]GCBC.optimizer_zero_grad | 0.00074818 | 20 | 0.014964 | 0.0091722 |
| [Callback]TQDMProgressBar.on_validation_batch_start | 0.00048066 | 22 | 0.010575 | 0.0064819 |
| [LightningModule]GCBC.on_validation_epoch_start | 0.0019881 | 3 | 0.0059643 | 0.0036559 |
| [Callback]TQDMProgressBar.on_train_epoch_end | 0.002866 | 2 | 0.0057319 | 0.0035135 |
| [LightningModule]GCBC.on_validation_epoch_end | 0.0017426 | 3 | 0.0052278 | 0.0032045 |
| on_train_batch_start | 0.0002309 | 20 | 0.004618 | 0.0028307 |
| [Callback]TQDMProgressBar.on_validation_end | 0.001477 | 3 | 0.004431 | 0.002716 |
| [LightningModule]GCBC.lr_scheduler_step | 0.00020036 | 20 | 0.0040073 | 0.0024563 |
| [LightningModule]GCBC.configure_optimizers | 0.0032249 | 1 | 0.0032249 | 0.0019768 |
| [LightningModule]GCBC.on_validation_model_train | 0.00097004 | 3 | 0.0029101 | 0.0017838 |
| [LightningModule]GCBC.on_train_epoch_end | 0.0013038 | 2 | 0.0026076 | 0.0015984 |
| [LightningModule]GCBC.on_validation_model_eval | 0.00085685 | 3 | 0.0025706 | 0.0015757 |
| [Callback]ModelSummary.on_fit_start | 0.0022834 | 1 | 0.0022834 | 0.0013997 |
| [Callback]TQDMProgressBar.on_train_epoch_start | 0.0010873 | 2 | 0.0021746 | 0.001333 |
| [Callback]TQDMProgressBar.on_validation_start | 0.00060386 | 3 | 0.0018116 | 0.0011104 |
| [LightningModule]GCBC.on_train_epoch_start | 0.00062183 | 2 | 0.0012437 | 0.00076232 |
| [Callback]TQDMProgressBar.on_train_end | 0.00070996 | 1 | 0.00070996 | 0.00043519 |
| [Callback]ModelSummary.on_validation_batch_end | 3.1038e-05 | 22 | 0.00068284 | 0.00041856 |
| [Callback]TQDMProgressBar.on_sanity_check_start | 0.00064962 | 1 | 0.00064962 | 0.0003982 |
| [Callback]KLConstantSchedule.on_validation_batch_start | 2.4464e-05 | 22 | 0.0005382 | 0.0003299 |
| [Callback]KLConstantSchedule.on_batch_start | 1.8397e-05 | 20 | 0.00036793 | 0.00022553 |
| [Callback]KLConstantSchedule.on_after_backward | 1.7703e-05 | 20 | 0.00035407 | 0.00021703 |
| [Callback]TQDMProgressBar.on_train_start | 0.00033575 | 1 | 0.00033575 | 0.00020581 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_end | 9.2307e-05 | 3 | 0.00027692 | 0.00016974 |
| [Callback]KLConstantSchedule.on_before_optimizer_step | 1.1763e-05 | 20 | 0.00023526 | 0.00014421 |
| [Callback]SignalCallback.on_validation_batch_start | 8.7329e-06 | 22 | 0.00019212 | 0.00011777 |
| [Callback]ModelSummary.on_validation_batch_start | 8.3075e-06 | 22 | 0.00018276 | 0.00011203 |
| [Callback]KLConstantSchedule.on_before_zero_grad | 8.9151e-06 | 20 | 0.0001783 | 0.00010929 |
| [Callback]SignalCallback.on_fit_start | 0.00017511 | 1 | 0.00017511 | 0.00010734 |
| [Callback]KLConstantSchedule.on_validation_batch_end | 7.3507e-06 | 22 | 0.00016172 | 9.9127e-05 |
| [LightningModule]GCBC.on_before_backward | 7.4764e-06 | 20 | 0.00014953 | 9.1656e-05 |
| [Callback]KLConstantSchedule.on_batch_end | 7.4612e-06 | 20 | 0.00014922 | 9.147e-05 |
| [Callback]SignalCallback.on_after_backward | 7.2972e-06 | 20 | 0.00014594 | 8.946e-05 |
| [Callback]KLConstantSchedule.on_before_backward | 6.6953e-06 | 20 | 0.00013391 | 8.208e-05 |
| [Callback]LearningRateMonitor.on_validation_batch_start | 5.6677e-06 | 22 | 0.00012469 | 7.643e-05 |
| [Callback]SignalCallback.on_before_optimizer_step | 5.9794e-06 | 20 | 0.00011959 | 7.3303e-05 |
| [Callback]SignalCallback.on_validation_batch_end | 5.4276e-06 | 22 | 0.00011941 | 7.3193e-05 |
| [Callback]LearningRateMonitor.on_after_backward | 5.7726e-06 | 20 | 0.00011545 | 7.0769e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_after_backward | 5.7529e-06 | 20 | 0.00011506 | 7.0528e-05 |
| [Callback]GradientAccumulationScheduler.on_validation_batch_end | 5.2079e-06 | 22 | 0.00011457 | 7.023e-05 |
| [Callback]LearningRateMonitor.on_train_start | 0.00011443 | 1 | 0.00011443 | 7.0145e-05 |
| [Callback]GradientAccumulationScheduler.on_after_backward | 5.6747e-06 | 20 | 0.00011349 | 6.9568e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_batch_start | 5.1261e-06 | 22 | 0.00011277 | 6.9127e-05 |
| [Callback]LearningRateMonitor.on_before_optimizer_step | 5.4161e-06 | 20 | 0.00010832 | 6.6398e-05 |
| [Callback]SignalCallback.on_batch_start | 5.4023e-06 | 20 | 0.00010805 | 6.6229e-05 |
| [Callback]TQDMProgressBar.on_after_backward | 5.3818e-06 | 20 | 0.00010764 | 6.5977e-05 |
| [Callback]SignalCallback.on_batch_end | 5.2926e-06 | 20 | 0.00010585 | 6.4885e-05 |
| [Callback]ModelSummary.on_before_optimizer_step | 5.2362e-06 | 20 | 0.00010472 | 6.4192e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_before_optimizer_step | 5.2008e-06 | 20 | 0.00010402 | 6.3759e-05 |
| [Callback]TQDMProgressBar.on_before_optimizer_step | 5.1992e-06 | 20 | 0.00010398 | 6.374e-05 |
| [Callback]ModelSummary.on_after_backward | 5.1827e-06 | 20 | 0.00010365 | 6.3536e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_batch_end | 4.688e-06 | 22 | 0.00010314 | 6.3219e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_batch_end | 5.1272e-06 | 20 | 0.00010254 | 6.2856e-05 |
| [Callback]GradientAccumulationScheduler.on_before_optimizer_step | 5.0483e-06 | 20 | 0.00010097 | 6.189e-05 |
| [Callback]GradientAccumulationScheduler.on_batch_start | 5.0287e-06 | 20 | 0.00010057 | 6.1648e-05 |
| [Callback]LearningRateMonitor.on_batch_start | 4.8955e-06 | 20 | 9.7911e-05 | 6.0016e-05 |
| [Callback]GradientAccumulationScheduler.on_batch_end | 4.8606e-06 | 20 | 9.7212e-05 | 5.9588e-05 |
| [Callback]LearningRateMonitor.on_validation_batch_end | 4.3611e-06 | 22 | 9.5945e-05 | 5.8811e-05 |
| [Callback]SignalCallback.on_before_backward | 4.6392e-06 | 20 | 9.2783e-05 | 5.6873e-05 |
| [Callback]LearningRateMonitor.on_batch_end | 4.628e-06 | 20 | 9.2561e-05 | 5.6737e-05 |
| [Callback]GradientAccumulationScheduler.on_validation_batch_start | 4.1629e-06 | 22 | 9.1583e-05 | 5.6138e-05 |
| [Callback]SignalCallback.on_before_zero_grad | 4.3882e-06 | 20 | 8.7764e-05 | 5.3797e-05 |
| [Callback]ModelSummary.on_validation_end | 2.9251e-05 | 3 | 8.7754e-05 | 5.3791e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_batch_start | 4.2837e-06 | 20 | 8.5673e-05 | 5.2515e-05 |
| [Callback]ModelSummary.on_batch_end | 4.2771e-06 | 20 | 8.5542e-05 | 5.2435e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_before_zero_grad | 4.183e-06 | 20 | 8.3661e-05 | 5.1282e-05 |
| [Callback]TQDMProgressBar.on_batch_end | 4.1562e-06 | 20 | 8.3124e-05 | 5.0952e-05 |
| [Callback]TQDMProgressBar.on_batch_start | 3.961e-06 | 20 | 7.922e-05 | 4.856e-05 |
| [Callback]LearningRateMonitor.on_before_backward | 3.9247e-06 | 20 | 7.8493e-05 | 4.8114e-05 |
| [Callback]ModelSummary.on_batch_start | 3.8938e-06 | 20 | 7.7875e-05 | 4.7735e-05 |
| [Callback]GradientAccumulationScheduler.on_before_zero_grad | 3.893e-06 | 20 | 7.7861e-05 | 4.7726e-05 |
| [Callback]TQDMProgressBar.on_before_backward | 3.8672e-06 | 20 | 7.7343e-05 | 4.7409e-05 |
| [Callback]LearningRateMonitor.on_before_zero_grad | 3.8611e-06 | 20 | 7.7222e-05 | 4.7335e-05 |
| [LightningModule]GCBC.on_validation_batch_start | 3.4864e-06 | 22 | 7.6701e-05 | 4.7015e-05 |
| [Callback]ModelSummary.on_before_zero_grad | 3.8142e-06 | 20 | 7.6283e-05 | 4.6759e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_before_backward | 3.801e-06 | 20 | 7.6021e-05 | 4.6599e-05 |
| [Callback]TQDMProgressBar.on_before_zero_grad | 3.6946e-06 | 20 | 7.3891e-05 | 4.5293e-05 |
| [Callback]ModelSummary.on_before_backward | 3.5121e-06 | 20 | 7.0242e-05 | 4.3056e-05 |
| [Callback]GradientAccumulationScheduler.on_before_backward | 3.4862e-06 | 20 | 6.9724e-05 | 4.2739e-05 |
| [LightningModule]GCBC.on_train_batch_end | 3.402e-06 | 20 | 6.804e-05 | 4.1706e-05 |
| [LightningModule]GCBC.training_step_end | 3.305e-06 | 20 | 6.61e-05 | 4.0517e-05 |
| [LightningModule]GCBC.on_before_zero_grad | 3.2048e-06 | 20 | 6.4095e-05 | 3.9288e-05 |
| [Callback]KLConstantSchedule.on_validation_end | 2.1258e-05 | 3 | 6.3773e-05 | 3.9091e-05 |
| [Callback]ModelSummary.on_train_epoch_end | 3.1795e-05 | 2 | 6.3591e-05 | 3.8979e-05 |
| [LightningModule]GCBC.validation_step_end | 2.8019e-06 | 22 | 6.1642e-05 | 3.7785e-05 |
| [LightningModule]GCBC.on_validation_batch_end | 2.796e-06 | 22 | 6.1512e-05 | 3.7705e-05 |
| [Callback]KLConstantSchedule.on_epoch_end | 1.205e-05 | 5 | 6.0252e-05 | 3.6933e-05 |
| [LightningModule]GCBC.on_train_batch_start | 2.9907e-06 | 20 | 5.9814e-05 | 3.6664e-05 |
| [LightningModule]GCBC.on_after_backward | 2.929e-06 | 20 | 5.8581e-05 | 3.5908e-05 |
| [Strategy]DDPStrategy.validation_step_end | 2.6555e-06 | 22 | 5.842e-05 | 3.581e-05 |
| [Callback]KLConstantSchedule.on_epoch_start | 1.0158e-05 | 5 | 5.0792e-05 | 3.1134e-05 |
| [LightningModule]GCBC.on_before_optimizer_step | 2.5365e-06 | 20 | 5.0731e-05 | 3.1096e-05 |
| [Callback]ModelSummary.on_train_epoch_start | 2.3921e-05 | 2 | 4.7842e-05 | 2.9326e-05 |
| [Strategy]DDPStrategy.on_train_batch_start | 2.369e-06 | 20 | 4.738e-05 | 2.9043e-05 |
| [Callback]SignalCallback.on_epoch_end | 8.676e-06 | 5 | 4.338e-05 | 2.6591e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': None}.setup | 4.2241e-05 | 1 | 4.2241e-05 | 2.5893e-05 |
| [Callback]LearningRateMonitor.on_epoch_end | 8.1663e-06 | 5 | 4.0831e-05 | 2.5028e-05 |
| [Callback]SignalCallback.on_epoch_start | 8.1402e-06 | 5 | 4.0701e-05 | 2.4949e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_epoch_end | 7.3404e-06 | 5 | 3.6702e-05 | 2.2497e-05 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_epoch_start | 7.3322e-06 | 5 | 3.6661e-05 | 2.2472e-05 |
| [Strategy]DDPStrategy.training_step_end | 1.768e-06 | 20 | 3.5359e-05 | 2.1674e-05 |
| [Callback]GradientAccumulationScheduler.on_epoch_start | 6.8702e-06 | 5 | 3.4351e-05 | 2.1056e-05 |
| [Callback]LearningRateMonitor.on_epoch_start | 6.086e-06 | 5 | 3.043e-05 | 1.8653e-05 |
| [Callback]KLConstantSchedule.on_train_epoch_start | 1.419e-05 | 2 | 2.838e-05 | 1.7396e-05 |
| [Callback]ModelSummary.on_validation_start | 8.5006e-06 | 3 | 2.5502e-05 | 1.5632e-05 |
| [Callback]KLConstantSchedule.on_validation_epoch_end | 8.3103e-06 | 3 | 2.4931e-05 | 1.5282e-05 |
| [Callback]GradientAccumulationScheduler.on_epoch_end | 4.9142e-06 | 5 | 2.4571e-05 | 1.5061e-05 |
| [Callback]KLConstantSchedule.on_validation_start | 8.1844e-06 | 3 | 2.4553e-05 | 1.505e-05 |
| [Callback]ModelSummary.on_epoch_end | 4.814e-06 | 5 | 2.407e-05 | 1.4754e-05 |
| [Callback]ModelSummary.on_epoch_start | 4.724e-06 | 5 | 2.362e-05 | 1.4478e-05 |
| [Callback]TQDMProgressBar.on_epoch_end | 4.5182e-06 | 5 | 2.2591e-05 | 1.3848e-05 |
| [Callback]LearningRateMonitor.on_train_epoch_start | 1.1021e-05 | 2 | 2.2042e-05 | 1.3511e-05 |
| [LightningModule]GCBC.validation_epoch_end | 7.2837e-06 | 3 | 2.1851e-05 | 1.3394e-05 |
| [Callback]TQDMProgressBar.on_epoch_start | 4.346e-06 | 5 | 2.173e-05 | 1.332e-05 |
| [Callback]GradientAccumulationScheduler.on_validation_end | 6.2104e-06 | 3 | 1.8631e-05 | 1.142e-05 |
| [Callback]SignalCallback.on_validation_start | 6.0137e-06 | 3 | 1.8041e-05 | 1.1059e-05 |
| [Callback]KLConstantSchedule.on_save_checkpoint | 8.8461e-06 | 2 | 1.7692e-05 | 1.0845e-05 |
| [Callback]GradientAccumulationScheduler.on_train_epoch_start | 8.605e-06 | 2 | 1.721e-05 | 1.0549e-05 |
| [Callback]KLConstantSchedule.on_validation_epoch_start | 5.347e-06 | 3 | 1.6041e-05 | 9.8326e-06 |
| [Callback]SignalCallback.on_validation_end | 5.2633e-06 | 3 | 1.579e-05 | 9.6787e-06 |
| [Callback]ModelSummary.on_train_end | 1.5701e-05 | 1 | 1.5701e-05 | 9.6242e-06 |
| [Callback]GradientAccumulationScheduler.on_validation_start | 5.1167e-06 | 3 | 1.535e-05 | 9.4091e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_start | 5.0467e-06 | 3 | 1.514e-05 | 9.2804e-06 |
| [Callback]TQDMProgressBar.on_sanity_check_end | 1.5061e-05 | 1 | 1.5061e-05 | 9.2319e-06 |
| [Callback]LearningRateMonitor.on_validation_start | 4.9601e-06 | 3 | 1.488e-05 | 9.1211e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_epoch_end | 4.7633e-06 | 3 | 1.429e-05 | 8.7593e-06 |
| [Callback]SignalCallback.on_validation_epoch_end | 4.7004e-06 | 3 | 1.4101e-05 | 8.6436e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_epoch_start | 4.6899e-06 | 3 | 1.407e-05 | 8.6243e-06 |
| [Callback]LearningRateMonitor.on_validation_epoch_end | 4.6834e-06 | 3 | 1.405e-05 | 8.6123e-06 |
| [Callback]TQDMProgressBar.on_validation_epoch_end | 4.6736e-06 | 3 | 1.4021e-05 | 8.5944e-06 |
| [Callback]ModelSummary.on_validation_epoch_start | 4.637e-06 | 3 | 1.3911e-05 | 8.527e-06 |
| [Callback]SignalCallback.on_validation_epoch_start | 4.5234e-06 | 3 | 1.357e-05 | 8.3181e-06 |
| [Callback]KLConstantSchedule.on_train_epoch_end | 6.76e-06 | 2 | 1.352e-05 | 8.2874e-06 |
| [LightningModule]GCBC.on_epoch_end | 2.6761e-06 | 5 | 1.338e-05 | 8.2017e-06 |
| [Callback]GradientAccumulationScheduler.on_validation_epoch_start | 4.3937e-06 | 3 | 1.3181e-05 | 8.0796e-06 |
| [Callback]ModelSummary.on_validation_epoch_end | 4.3337e-06 | 3 | 1.3001e-05 | 7.9692e-06 |
| [Callback]TQDMProgressBar.on_validation_epoch_start | 4.27e-06 | 3 | 1.281e-05 | 7.8522e-06 |
| [Callback]LearningRateMonitor.on_validation_epoch_start | 4.2599e-06 | 3 | 1.278e-05 | 7.8337e-06 |
| [Callback]GradientAccumulationScheduler.on_validation_epoch_end | 4.2304e-06 | 3 | 1.2691e-05 | 7.7793e-06 |
| [Callback]TQDMProgressBar.setup | 1.263e-05 | 1 | 1.263e-05 | 7.7419e-06 |
| [Callback]GradientAccumulationScheduler.on_train_epoch_end | 6.3105e-06 | 2 | 1.2621e-05 | 7.7363e-06 |
| [Callback]LearningRateMonitor.on_validation_end | 4.1804e-06 | 3 | 1.2541e-05 | 7.6874e-06 |
| [LightningModule]GCBC.on_epoch_start | 2.2219e-06 | 5 | 1.111e-05 | 6.8099e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_start | 1.103e-05 | 1 | 1.103e-05 | 6.7611e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_epoch_start | 5.2999e-06 | 2 | 1.06e-05 | 6.4974e-06 |
| [Callback]SignalCallback.on_train_epoch_start | 5.0949e-06 | 2 | 1.019e-05 | 6.2461e-06 |
| [Callback]SignalCallback.on_save_checkpoint | 4.94e-06 | 2 | 9.8799e-06 | 6.0561e-06 |
| [LightningModule]GCBC.configure_callbacks | 9.8401e-06 | 1 | 9.8401e-06 | 6.0317e-06 |
| [Callback]ModelSummary.on_save_checkpoint | 4.6155e-06 | 2 | 9.231e-06 | 5.6584e-06 |
| [Callback]GradientAccumulationScheduler.on_fit_start | 9.0799e-06 | 1 | 9.0799e-06 | 5.5657e-06 |
| [Strategy]DDPStrategy.on_validation_end | 2.9767e-06 | 3 | 8.9302e-06 | 5.474e-06 |
| [Callback]ModelSummary.on_sanity_check_start | 8.8899e-06 | 1 | 8.8899e-06 | 5.4493e-06 |
| [Callback]ModelSummary.on_train_start | 8.76e-06 | 1 | 8.76e-06 | 5.3696e-06 |
| [Callback]SignalCallback.on_train_epoch_end | 4.3655e-06 | 2 | 8.7309e-06 | 5.3518e-06 |
| [Callback]TQDMProgressBar.on_save_checkpoint | 4.3405e-06 | 2 | 8.6811e-06 | 5.3213e-06 |
| [Callback]LearningRateMonitor.on_save_checkpoint | 4.095e-06 | 2 | 8.1901e-06 | 5.0203e-06 |
| [Callback]KLConstantSchedule.on_train_end | 8.0599e-06 | 1 | 8.0599e-06 | 4.9405e-06 |
| [Callback]GradientAccumulationScheduler.on_train_start | 7.72e-06 | 1 | 7.72e-06 | 4.7321e-06 |
| [LightningModule]GCBC.on_validation_end | 2.55e-06 | 3 | 7.6501e-06 | 4.6893e-06 |
| [Callback]LearningRateMonitor.on_train_epoch_end | 3.74e-06 | 2 | 7.4799e-06 | 4.585e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_save_checkpoint | 3.7251e-06 | 2 | 7.4501e-06 | 4.5667e-06 |
| [Callback]GradientAccumulationScheduler.on_save_checkpoint | 3.5899e-06 | 2 | 7.1798e-06 | 4.401e-06 |
| [Callback]KLConstantSchedule.on_pretrain_routine_start | 7.041e-06 | 1 | 7.041e-06 | 4.3159e-06 |
| [Callback]KLConstantSchedule.on_train_start | 6.73e-06 | 1 | 6.73e-06 | 4.1253e-06 |
| [Callback]KLConstantSchedule.on_sanity_check_end | 6.1609e-06 | 1 | 6.1609e-06 | 3.7765e-06 |
| [Callback]GradientAccumulationScheduler.on_train_end | 6.15e-06 | 1 | 6.15e-06 | 3.7698e-06 |
| [LightningModule]GCBC.prepare_data | 6.1211e-06 | 1 | 6.1211e-06 | 3.7521e-06 |
| [Callback]LearningRateMonitor.on_fit_start | 6.011e-06 | 1 | 6.011e-06 | 3.6846e-06 |
| [Callback]KLConstantSchedule.on_fit_end | 5.98e-06 | 1 | 5.98e-06 | 3.6656e-06 |
| [Callback]SignalCallback.on_train_end | 5.9502e-06 | 1 | 5.9502e-06 | 3.6473e-06 |
| [LightningModule]GCBC.on_train_start | 5.8399e-06 | 1 | 5.8399e-06 | 3.5797e-06 |
| [LightningModule]GCBC.on_validation_start | 1.9204e-06 | 3 | 5.7612e-06 | 3.5314e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_end | 5.76e-06 | 1 | 5.76e-06 | 3.5307e-06 |
| [Callback]KLConstantSchedule.on_fit_start | 5.6999e-06 | 1 | 5.6999e-06 | 3.4939e-06 |
| [Strategy]DDPStrategy.on_validation_start | 1.867e-06 | 3 | 5.601e-06 | 3.4332e-06 |
| [Callback]SignalCallback.on_pretrain_routine_start | 5.2601e-06 | 1 | 5.2601e-06 | 3.2243e-06 |
| [Callback]SignalCallback.on_train_start | 5.15e-06 | 1 | 5.15e-06 | 3.1568e-06 |
| [Callback]KLConstantSchedule.on_sanity_check_start | 5e-06 | 1 | 5e-06 | 3.0649e-06 |
| [Callback]ModelSummary.on_sanity_check_end | 4.9002e-06 | 1 | 4.9002e-06 | 3.0036e-06 |
| [Callback]KLConstantSchedule.setup | 4.8e-06 | 1 | 4.8e-06 | 2.9423e-06 |
| [Callback]LearningRateMonitor.on_train_end | 4.76e-06 | 1 | 4.76e-06 | 2.9177e-06 |
| [Callback]GradientAccumulationScheduler.on_sanity_check_start | 4.6499e-06 | 1 | 4.6499e-06 | 2.8502e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_fit_start | 4.58e-06 | 1 | 4.58e-06 | 2.8074e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_sanity_check_start | 4.2499e-06 | 1 | 4.2499e-06 | 2.605e-06 |
| [Callback]LearningRateMonitor.on_pretrain_routine_start | 4.1202e-06 | 1 | 4.1202e-06 | 2.5255e-06 |
| [Callback]SignalCallback.on_sanity_check_start | 4.0801e-06 | 1 | 4.0801e-06 | 2.501e-06 |
| [Callback]KLConstantSchedule.on_pretrain_routine_end | 3.96e-06 | 1 | 3.96e-06 | 2.4274e-06 |
| [Callback]GradientAccumulationScheduler.on_sanity_check_end | 3.95e-06 | 1 | 3.95e-06 | 2.4212e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_sanity_check_end | 3.95e-06 | 1 | 3.95e-06 | 2.4212e-06 |
| [LightningModule]GCBC.on_save_checkpoint | 1.975e-06 | 2 | 3.95e-06 | 2.4212e-06 |
| [Callback]SignalCallback.on_sanity_check_end | 3.9199e-06 | 1 | 3.9199e-06 | 2.4028e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_pretrain_routine_start | 3.8701e-06 | 1 | 3.8701e-06 | 2.3723e-06 |
| [Callback]TQDMProgressBar.on_fit_start | 3.8098e-06 | 1 | 3.8098e-06 | 2.3353e-06 |
| [Callback]LearningRateMonitor.on_sanity_check_end | 3.74e-06 | 1 | 3.74e-06 | 2.2925e-06 |
| [Callback]GradientAccumulationScheduler.on_pretrain_routine_start | 3.6701e-06 | 1 | 3.6701e-06 | 2.2497e-06 |
| [Callback]SignalCallback.on_pretrain_routine_end | 3.57e-06 | 1 | 3.57e-06 | 2.1883e-06 |
| [Callback]TQDMProgressBar.on_pretrain_routine_start | 3.54e-06 | 1 | 3.54e-06 | 2.1699e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_pretrain_routine_end | 3.5202e-06 | 1 | 3.5202e-06 | 2.1578e-06 |
| [Callback]LearningRateMonitor.on_sanity_check_start | 3.5199e-06 | 1 | 3.5199e-06 | 2.1576e-06 |
| [Callback]ModelSummary.on_pretrain_routine_start | 3.441e-06 | 1 | 3.441e-06 | 2.1092e-06 |
| [Callback]KLConstantSchedule.on_before_accelerator_backend_setup | 3.4301e-06 | 1 | 3.4301e-06 | 2.1025e-06 |
| [Callback]LearningRateMonitor.on_pretrain_routine_end | 3.421e-06 | 1 | 3.421e-06 | 2.097e-06 |
| [Callback]TQDMProgressBar.on_pretrain_routine_end | 3.41e-06 | 1 | 3.41e-06 | 2.0903e-06 |
| [Callback]ModelSummary.on_pretrain_routine_end | 3.35e-06 | 1 | 3.35e-06 | 2.0534e-06 |
| [Callback]GradientAccumulationScheduler.on_pretrain_routine_end | 3.34e-06 | 1 | 3.34e-06 | 2.0473e-06 |
| [Callback]KLConstantSchedule.on_configure_sharded_model | 2.79e-06 | 1 | 2.79e-06 | 1.7102e-06 |
| [Callback]KLConstantSchedule.teardown | 2.5302e-06 | 1 | 2.5302e-06 | 1.5509e-06 |
| [Callback]SignalCallback.setup | 2.4801e-06 | 1 | 2.4801e-06 | 1.5202e-06 |
| [Callback]TQDMProgressBar.on_before_accelerator_backend_setup | 2.42e-06 | 1 | 2.42e-06 | 1.4834e-06 |
| [LightningModule]GCBC.on_train_end | 2.0701e-06 | 1 | 2.0701e-06 | 1.2689e-06 |
| [LightningModule]GCBC.on_train_dataloader | 2.0599e-06 | 1 | 2.0599e-06 | 1.2626e-06 |
| [Callback]SignalCallback.on_configure_sharded_model | 2.0498e-06 | 1 | 2.0498e-06 | 1.2565e-06 |
| [Callback]ModelSummary.setup | 1.98e-06 | 1 | 1.98e-06 | 1.2137e-06 |
| [Callback]SignalCallback.on_before_accelerator_backend_setup | 1.97e-06 | 1 | 1.97e-06 | 1.2075e-06 |
| [Callback]LearningRateMonitor.setup | 1.9399e-06 | 1 | 1.9399e-06 | 1.1891e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': None}.on_before_accelerator_backend_setup | 1.9108e-06 | 1 | 1.9108e-06 | 1.1713e-06 |
| [Callback]SignalCallback.on_fit_end | 1.8999e-06 | 1 | 1.8999e-06 | 1.1646e-06 |
| [Callback]LearningRateMonitor.on_configure_sharded_model | 1.8501e-06 | 1 | 1.8501e-06 | 1.134e-06 |
| [LightningModule]GCBC.setup | 1.76e-06 | 1 | 1.76e-06 | 1.0788e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_fit_end | 1.7299e-06 | 1 | 1.7299e-06 | 1.0604e-06 |
| [Callback]ModelSummary.on_configure_sharded_model | 1.7202e-06 | 1 | 1.7202e-06 | 1.0544e-06 |
| [Callback]TQDMProgressBar.on_configure_sharded_model | 1.6999e-06 | 1 | 1.6999e-06 | 1.042e-06 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_configure_sharded_model | 1.6901e-06 | 1 | 1.6901e-06 | 1.036e-06 |
| [Callback]LearningRateMonitor.on_fit_end | 1.6699e-06 | 1 | 1.6699e-06 | 1.0236e-06 |
| [Callback]GradientAccumulationScheduler.on_configure_sharded_model | 1.63e-06 | 1 | 1.63e-06 | 9.9917e-07 |
| [Callback]ModelSummary.on_fit_end | 1.6289e-06 | 1 | 1.6289e-06 | 9.9846e-07 |
| [Callback]LearningRateMonitor.on_before_accelerator_backend_setup | 1.62e-06 | 1 | 1.62e-06 | 9.9303e-07 |
| [Callback]SignalCallback.teardown | 1.62e-06 | 1 | 1.62e-06 | 9.9303e-07 |
| [Callback]GradientAccumulationScheduler.setup | 1.5891e-06 | 1 | 1.5891e-06 | 9.7405e-07 |
| [Callback]TQDMProgressBar.on_fit_end | 1.5299e-06 | 1 | 1.5299e-06 | 9.378e-07 |
| [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.teardown | 1.5199e-06 | 1 | 1.5199e-06 | 9.3167e-07 |
| [Callback]GradientAccumulationScheduler.on_before_accelerator_backend_setup | 1.5099e-06 | 1 | 1.5099e-06 | 9.2553e-07 |
| [Callback]LearningRateMonitor.teardown | 1.5099e-06 | 1 | 1.5099e-06 | 9.2553e-07 |
| [Callback]ModelSummary.teardown | 1.5099e-06 | 1 | 1.5099e-06 | 9.2553e-07 |
| [LightningModule]GCBC.on_fit_end | 1.4901e-06 | 1 | 1.4901e-06 | 9.134e-07 |
| [LightningModule]GCBC.configure_sharded_model | 1.4701e-06 | 1 | 1.4701e-06 | 9.0112e-07 |
| [Callback]GradientAccumulationScheduler.on_fit_end | 1.4699e-06 | 1 | 1.4699e-06 | 9.0098e-07 |
| [Callback]ModelSummary.on_before_accelerator_backend_setup | 1.4601e-06 | 1 | 1.4601e-06 | 8.9499e-07 |
| [Callback]GradientAccumulationScheduler.teardown | 1.4e-06 | 1 | 1.4e-06 | 8.5817e-07 |
| [Callback]TQDMProgressBar.teardown | 1.39e-06 | 1 | 1.39e-06 | 8.5203e-07 |
| [Strategy]DDPStrategy.on_train_end | 1.211e-06 | 1 | 1.211e-06 | 7.4228e-07 |
| [LightningModule]GCBC.on_val_dataloader | 1.2e-06 | 1 | 1.2e-06 | 7.3557e-07 |
| [Strategy]DDPStrategy.on_train_start | 1.16e-06 | 1 | 1.16e-06 | 7.1102e-07 |
| [LightningModule]GCBC.on_pretrain_routine_start | 1.15e-06 | 1 | 1.15e-06 | 7.0489e-07 |
| [LightningModule]GCBC.on_pretrain_routine_end | 8.1002e-07 | 1 | 8.1002e-07 | 4.9652e-07 |
| [LightningModule]GCBC.teardown | 7.4995e-07 | 1 | 7.4995e-07 | 4.597e-07 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Could you provide the exact command that you used for running this? And is this on a SLURM cluster with 4x3090?
I should have added that you should disable the rollout callbacks for profiling since they have computational overload in the first epochs. You can do that by appending ~callbacks/rollout ~callbacks/rollout_lh ~callbacks/tsne_plot
to your run command.
Hi, I did disable the rollout callbacks for this run. The training was done on the cluster with 4x3090 but started without slurm.
The command:
/home/temp_store/miniconda3/envs/c_env/bin/python /home/temp_store/code/hulc/hulc/training.py +trainer.limit_train_batches=10 +trainer.limit_val_batches=10
The config I used:
callbacks:
checkpoint:
_target_: pytorch_lightning.callbacks.ModelCheckpoint
save_top_k: -1
verbose: true
dirpath: saved_models
filename: '{epoch}'
kl_schedule:
_target_: hulc.utils.kl_callbacks.KLConstantSchedule
shm_signal:
_target_: hulc.datasets.utils.shared_memory_utils.SignalCallback
datamodule:
datasets:
vision_dataset:
_target_: hulc.datasets.disk_dataset.DiskDataset
key: vis
save_format: npz
batch_size: 32
min_window_size: 20
max_window_size: 32
proprio_state: ${datamodule.proprioception_dims}
obs_space: ${datamodule.observation_space}
pad: true
lang_folder: lang_paraphrase-MiniLM-L3-v2
num_workers: 2
lang_dataset:
_target_: hulc.datasets.disk_dataset.DiskDataset
key: lang
save_format: npz
batch_size: 32
min_window_size: 20
max_window_size: 32
proprio_state: ${datamodule.proprioception_dims}
obs_space: ${datamodule.observation_space}
skip_frames: 1
pad: true
lang_folder: lang_paraphrase-MiniLM-L3-v2
aux_lang_loss_window: 8
num_workers: 2
transforms:
train:
rgb_static:
- _target_: torchvision.transforms.Resize
size: 200
- _target_: hulc.utils.transforms.RandomShiftsAug
pad: 10
- _target_: hulc.utils.transforms.ScaleImageTensor
- _target_: torchvision.transforms.Normalize
mean:
- 0.5
std:
- 0.5
rgb_gripper:
- _target_: torchvision.transforms.Resize
size: 84
- _target_: hulc.utils.transforms.RandomShiftsAug
pad: 4
- _target_: hulc.utils.transforms.ScaleImageTensor
- _target_: torchvision.transforms.Normalize
mean:
- 0.5
std:
- 0.5
depth_static:
- _target_: torchvision.transforms.Resize
size: 200
- _target_: hulc.utils.transforms.AddDepthNoise
shape:
- 1000.0
rate:
- 1000.0
depth_gripper:
- _target_: torchvision.transforms.Resize
size: 84
- _target_: hulc.utils.transforms.AddGaussianNoise
mean:
- 0.0
std:
- 0.01
rgb_tactile:
- _target_: torchvision.transforms.Resize
size: 70
- _target_: torchvision.transforms.RandomCrop
size: 64
- _target_: hulc.utils.transforms.ScaleImageTensor
- _target_: torchvision.transforms.Normalize
mean:
- 0.5
std:
- 0.5
depth_tactile:
- _target_: torchvision.transforms.Resize
size: 64
- _target_: torchvision.transforms.Normalize
mean:
- 0.1
std:
- 0.2
robot_obs:
- _target_: hulc.utils.transforms.NormalizeVector
scene_obs:
- _target_: hulc.utils.transforms.NormalizeVector
val:
rgb_static:
- _target_: torchvision.transforms.Resize
size: 200
- _target_: hulc.utils.transforms.ScaleImageTensor
- _target_: torchvision.transforms.Normalize
mean:
- 0.5
std:
- 0.5
rgb_gripper:
- _target_: torchvision.transforms.Resize
size: 84
- _target_: hulc.utils.transforms.ScaleImageTensor
- _target_: torchvision.transforms.Normalize
mean:
- 0.5
std:
- 0.5
depth_static:
- _target_: torchvision.transforms.Resize
size: 200
depth_gripper:
- _target_: torchvision.transforms.Resize
size: 84
rgb_tactile:
- _target_: torchvision.transforms.Resize
size: 70
- _target_: torchvision.transforms.RandomCrop
size: 64
- _target_: hulc.utils.transforms.ScaleImageTensor
- _target_: torchvision.transforms.Normalize
mean:
- 0.5
std:
- 0.5
depth_tactile:
- _target_: torchvision.transforms.Resize
size: 64
- _target_: torchvision.transforms.Normalize
mean:
- 0.1
std:
- 0.2
robot_obs:
- _target_: hulc.utils.transforms.NormalizeVector
scene_obs:
- _target_: hulc.utils.transforms.NormalizeVector
proprioception_dims:
n_state_obs: 8
keep_indices:
- - 0
- 7
- - 14
- 15
robot_orientation_idx:
- 3
- 6
normalize: true
normalize_robot_orientation: true
observation_space:
rgb_obs:
- rgb_static
- rgb_gripper
depth_obs: []
state_obs:
- robot_obs
actions:
- rel_actions
language:
- language
_target_: hulc.datasets.hulc_data_module.HulcDataModule
_recursive_: false
root_data_dir: /home/temp_store/calvin_data/task_D_D
action_space: 7
action_max:
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
action_min:
- -1.0
- -1.0
- -1.0
- -1.0
- -1.0
- -1.0
- -1
shuffle_val: false
model:
perceptual_encoder:
rgb_static:
_target_: hulc.models.perceptual_encoders.vision_network.VisionNetwork
input_width: 200
input_height: 200
activation_function: ReLU
dropout_vis_fc: 0.0
l2_normalize_output: false
visual_features: 64
num_c: 3
use_sinusoid: false
spatial_softmax_temp: 1.0
rgb_gripper:
_target_: hulc.models.perceptual_encoders.vision_network_gripper.VisionNetwork
input_width: 84
input_height: 84
activation_function: ReLU
dropout_vis_fc: 0.0
l2_normalize_output: false
visual_features: 64
conv_encoder: nature_cnn
num_c: 3
depth_static: {}
depth_gripper: {}
proprio: {}
tactile: {}
_target_: hulc.models.perceptual_encoders.concat_encoders.ConcatEncoders
_recursive_: false
plan_proposal:
_target_: hulc.models.plan_encoders.plan_proposal_net.PlanProposalNetwork
perceptual_features: ???
latent_goal_features: ${model.visual_goal.latent_goal_features}
plan_features: ???
activation_function: ReLU
hidden_size: 2048
plan_recognition:
_target_: hulc.models.plan_encoders.plan_recognition_net.PlanRecognitionTransformersNetwork
num_heads: 8
num_layers: 2
encoder_hidden_size: 2048
fc_hidden_size: 4096
in_features: ??
plan_features: ???
action_space: ${datamodule.action_space}
dropout_p: 0.1
encoder_normalize: false
positional_normalize: false
position_embedding: true
max_position_embeddings: ${datamodule.datasets.lang_dataset.max_window_size}
distribution:
_target_: hulc.utils.distributions.Distribution
dist: discrete
category_size: 32
class_size: 32
visual_goal:
_target_: hulc.models.encoders.goal_encoders.VisualGoalEncoder
in_features: ???
hidden_size: 2048
latent_goal_features: 32
l2_normalize_goal_embeddings: false
activation_function: ReLU
language_goal:
_target_: hulc.models.encoders.goal_encoders.LanguageGoalEncoder
in_features: 384
hidden_size: 2048
latent_goal_features: 32
l2_normalize_goal_embeddings: false
activation_function: ReLU
word_dropout_p: 0.0
action_decoder:
_target_: hulc.models.decoders.logistic_decoder_rnn.LogisticDecoderRNN
n_mixtures: 10
hidden_size: 2048
out_features: ${datamodule.action_space}
log_scale_min: -7.0
act_max_bound: ${datamodule.action_max}
act_min_bound: ${datamodule.action_min}
dataset_dir: ${datamodule.root_data_dir}
load_action_bounds: false
num_classes: 10
latent_goal_features: ${model.visual_goal.latent_goal_features}
plan_features: ???
perceptual_features: ???
gripper_alpha: 1.0
perceptual_emb_slice:
- 64
- 128
policy_rnn_dropout_p: 0.0
num_layers: 2
rnn_model: rnn_decoder
gripper_control: true
discrete_gripper: true
optimizer:
_target_: torch.optim.Adam
lr: ${training.lr}
lr_scheduler:
_target_: transformers.get_constant_schedule
bc_z_lang_decoder: {}
mia_lang_discriminator: {}
proj_vis_lang:
_target_: hulc.models.auxiliary_loss_networks.proj_vis_lang.ProjVisLang
im_dim: ${model.plan_recognition.fc_hidden_size}
lang_dim: ${model.language_goal.latent_goal_features}
output_dim: ${model.language_goal.latent_goal_features}
proj_lang: true
val_instructions:
rotate_red_block_right:
- take the red block and rotate it to the right
rotate_red_block_left:
- take the red block and rotate it to the left
rotate_blue_block_right:
- take the blue block and rotate it to the right
rotate_blue_block_left:
- take the blue block and rotate it to the left
rotate_pink_block_right:
- take the pink block and rotate it to the right
rotate_pink_block_left:
- take the pink block and rotate it to the left
push_red_block_right:
- go push the red block right
push_red_block_left:
- go push the red block left
push_blue_block_right:
- go push the blue block right
push_blue_block_left:
- go push the blue block left
push_pink_block_right:
- go push the pink block right
push_pink_block_left:
- go push the pink block left
move_slider_left:
- push the sliding door to the left side
move_slider_right:
- push the sliding door to the right side
open_drawer:
- pull the handle to open the drawer
close_drawer:
- push the handle to close the drawer
lift_red_block_table:
- grasp and lift the red block
lift_blue_block_table:
- grasp and lift the blue block
lift_pink_block_table:
- grasp and lift the pink block
lift_red_block_slider:
- lift the red block from the sliding cabinet
lift_blue_block_slider:
- lift the blue block from the sliding cabinet
lift_pink_block_slider:
- lift the pink block from the sliding cabinet
lift_red_block_drawer:
- Take the red block from the drawer
lift_blue_block_drawer:
- Take the blue block from the drawer
lift_pink_block_drawer:
- Take the pink block from the drawer
place_in_slider:
- store the grasped block in the sliding cabinet
place_in_drawer:
- store the grasped block in the drawer
push_into_drawer:
- slide the block that it falls into the drawer
stack_block:
- stack the grasped block
unstack_block:
- remove the stacked block
turn_on_lightbulb:
- use the switch to turn on the light bulb
turn_off_lightbulb:
- use the switch to turn off the light bulb
turn_on_led:
- press the button to turn on the led light
turn_off_led:
- press the button to turn off the led light
_target_: hulc.models.gcbc.GCBC
_recursive_: false
kl_beta: ${loss.kl_beta}
kl_balancing_mix: ${loss.kl_balancing_mix}
state_recons: false
state_recon_beta: ${loss.state_recon_beta}
use_bc_z_auxiliary_loss: false
bc_z_auxiliary_loss_beta: ${loss.bc_z_auxiliary_loss_beta}
use_mia_auxiliary_loss: false
mia_auxiliary_loss_beta: ${loss.mia_auxiliary_loss_beta}
replan_freq: 30
use_clip_auxiliary_loss: true
clip_auxiliary_loss_beta: ${loss.clip_auxiliary_loss_beta}
loss:
kl_beta: 0.01
state_recon_beta: 0.5
kl_balancing_mix: 0.8
bc_z_auxiliary_loss_beta: 1.0
mia_auxiliary_loss_beta: 1.0
clip_auxiliary_loss_beta: 3.0
training:
lr: 0.0002
trainer:
gpus: 4
precision: 16
val_check_interval: 1.0
max_epochs: 2
sync_batchnorm: false
limit_train_batches: 10
limit_val_batches: 10
logger:
_target_: pytorch_lightning.loggers.WandbLogger
save_dir: .
name: play_lmp
group: play_lmp
log_model: false
project: calvin_vision
id: ???
seed: 42
log_dir: ../
slurm: false
Just to make sure this is not the issue, how do you start your interactive slurm session?
Did you reserve enough CPU cores? What is the output in the terminal for nproc
?
Right now, I am using the cluster like a normal PC with 4 GPUs and I am running the training script directly in python without slurm, since I am the only one using it. I have started jobs with the slurm variant but I had the same problems. The return of nproc
is 48. During my attempts to increase the number of workers, the training crashed. I think the limiting factor related to this could be the 128 GB of RAM. During training all GPUs are found and in use, however their average load is usually close to 10% at maximum and the percentage of used memory is also similar.
How fast does it run if you only use 1 or 2 GPUs for example?
Similar speed. I am testing some models using the tacorl dataset and code and here the average training time for an epoch is around 1h. I know the dataset is a bit smaller and without language but here it seems to work fine there.
Hi @mbreuss, did you maybe run the shared memory of a smaller debug dataset before? Try to delete the shared memory in
/dev/shm/
, they are called/dev/shm/train_*
and/dev/shm/val_*
. Also delete thetrain_shm_lookup.npy
and theval_shm_lookup.npy
in tmp or slurm_temp directory (see here).It's weird that it takes so long without the shared memory, there definitely seems to be something wrong. Shared memory gave us a speed-up of max. 50%. From where do you load the dataset, is it maybe via a slow network mount? Try running the PyTorch-lightning profiler to see where's the bottleneck, you can also post the results here and I compare them to our cluster. For debugging, you might want to use the hydra cmd line flags
+trainer.limit_train_batches=10
and+trainer.limit_val_batches=10
(or a similar number), then you don't have to wait for the whole epoch, but you still get an estimate of the time.
I also get the key error when executing https://github.com/lukashermann/hulc/blob/main/hulc/datasets/utils/shared_memory_utils.py#L192. I have already deleted train_shm_lookup.npy
and val_shm_lookup.npy
also /dev/shm/train_*
and /dev/shm/val_*
. However, I still get the keyerror.
ps. I was using the ABC dataset to train
updates: I found the problem. Since the ABC dataset is very large, the mounted /dev/shm
is not big enough to load all episodes. One solution is enlarging the/dev/shm
by remount command if the server has enough memory at least 200GB.
Thanks for pointing that out. My memory is also small, I will check if I have the same problem.
Hi,
I am currently facing the error, while trying to use the shared memory variant of the dataset D. The error occurs in the following line: https://github.com/lukashermann/hulc/blob/main/hulc/datasets/utils/shared_memory_utils.py#L192 where the start_idx variable does not match the current dataset. I tried to reinstall the dataset to make sure I installed everything, but it did not help and tried to fix it.
Without using the shared memory variant the code runs without any errors. However, I have some general performance issues using a Slurm cluster with 4x3090. Currently, one epoch of training Hulc on task D takes approximate 70 hours without the shared memory. I already tried experimenting with the batch size and the number of workers, but so far it did not help. Does not using the shared memory dataset causes such a huge difference in training speed? Do you have some advice to improve the performance?
Thanks in advance! Best regards.