lukashermann / hulc

Hierarchical Universal Language Conditioned Policies
MIT License
62 stars 9 forks source link

Error while loading data into shared memory #8

Closed mbreuss closed 5 months ago

mbreuss commented 1 year ago


I am currently facing the error, while trying to use the shared memory variant of the dataset D. The error occurs in the following line: where the start_idx variable does not match the current dataset. I tried to reinstall the dataset to make sure I installed everything, but it did not help and tried to fix it.


Without using the shared memory variant the code runs without any errors. However, I have some general performance issues using a Slurm cluster with 4x3090. Currently, one epoch of training Hulc on task D takes approximate 70 hours without the shared memory. I already tried experimenting with the batch size and the number of workers, but so far it did not help. Does not using the shared memory dataset causes such a huge difference in training speed? Do you have some advice to improve the performance?

Thanks in advance! Best regards.

lukashermann commented 1 year ago

Hi @mbreuss, did you maybe run the shared memory of a smaller debug dataset before? Try to delete the shared memory in /dev/shm/, they are called /dev/shm/train_* and /dev/shm/val_*. Also delete the train_shm_lookup.npy and the val_shm_lookup.npy in tmp or slurm_temp directory (see here).

It's weird that it takes so long without the shared memory, there definitely seems to be something wrong. Shared memory gave us a speed-up of max. 50%. From where do you load the dataset, is it maybe via a slow network mount? Try running the PyTorch-lightning profiler to see where's the bottleneck, you can also post the results here and I compare them to our cluster. For debugging, you might want to use the hydra cmd line flags +trainer.limit_train_batches=10 and +trainer.limit_val_batches=10 (or a similar number), then you don't have to wait for the whole epoch, but you still get an estimate of the time.

mbreuss commented 1 year ago


thanks for the tips. This is the result of using the simple profiler with 10 train and validation batches:

FIT Profiler Report

|  Action                                                                                                                                                                                                       |  Mean duration (s)    |  Num calls      |  Total time (s)        |  Percentage %         |
|  Total                                                                                                                                                                                                        |  -                    |  1981           |  163.14                |  100 %                |
|  run_training_epoch                                                                                                                                                                                           |  67.911               |  2              |  135.82                |  83.255               |
|  [Strategy]DDPStrategy.batch_to_device                                                                                                                                                                        |  0.66106              |  42             |  27.764                |  17.019               |
|  run_training_batch                                                                                                                                                                                           |  0.62769              |  20             |  12.554                |  7.6951               |
|  [LightningModule]GCBC.optimizer_step                                                                                                                                                                         |  0.62632              |  20             |  12.526                |  7.6784               |
|  [Strategy]DDPStrategy.backward                                                                                                                                                                               |  0.578                |  20             |  11.56                 |  7.0859               |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_epoch_end                    |  1.6921               |  2              |  3.3842                |  2.0744               |
|  [Strategy]DDPStrategy.validation_step                                                                                                                                                                        |  0.096068             |  22             |  2.1135                |  1.2955               |
|  [Strategy]DDPStrategy.training_step                                                                                                                                                                          |  0.038656             |  20             |  0.77312               |  0.4739               |
|  on_train_batch_end                                                                                                                                                                                           |  0.0028381            |  20             |  0.056762              |  0.034793             |
|  [LightningModule]GCBC.on_fit_start                                                                                                                                                                           |  0.05676              |  1              |  0.05676               |  0.034792             |
|  [Callback]TQDMProgressBar.on_validation_batch_end                                                                                                                                                            |  0.0023747            |  22             |  0.052243              |  0.032023             |
|  [LightningModule]GCBC.optimizer_zero_grad                                                                                                                                                                    |  0.00074818           |  20             |  0.014964              |  0.0091722            |
|  [Callback]TQDMProgressBar.on_validation_batch_start                                                                                                                                                          |  0.00048066           |  22             |  0.010575              |  0.0064819            |
|  [LightningModule]GCBC.on_validation_epoch_start                                                                                                                                                              |  0.0019881            |  3              |  0.0059643             |  0.0036559            |
|  [Callback]TQDMProgressBar.on_train_epoch_end                                                                                                                                                                 |  0.002866             |  2              |  0.0057319             |  0.0035135            |
|  [LightningModule]GCBC.on_validation_epoch_end                                                                                                                                                                |  0.0017426            |  3              |  0.0052278             |  0.0032045            |
|  on_train_batch_start                                                                                                                                                                                         |  0.0002309            |  20             |  0.004618              |  0.0028307            |
|  [Callback]TQDMProgressBar.on_validation_end                                                                                                                                                                  |  0.001477             |  3              |  0.004431              |  0.002716             |
|  [LightningModule]GCBC.lr_scheduler_step                                                                                                                                                                      |  0.00020036           |  20             |  0.0040073             |  0.0024563            |
|  [LightningModule]GCBC.configure_optimizers                                                                                                                                                                   |  0.0032249            |  1              |  0.0032249             |  0.0019768            |
|  [LightningModule]GCBC.on_validation_model_train                                                                                                                                                              |  0.00097004           |  3              |  0.0029101             |  0.0017838            |
|  [LightningModule]GCBC.on_train_epoch_end                                                                                                                                                                     |  0.0013038            |  2              |  0.0026076             |  0.0015984            |
|  [LightningModule]GCBC.on_validation_model_eval                                                                                                                                                               |  0.00085685           |  3              |  0.0025706             |  0.0015757            |
|  [Callback]ModelSummary.on_fit_start                                                                                                                                                                          |  0.0022834            |  1              |  0.0022834             |  0.0013997            |
|  [Callback]TQDMProgressBar.on_train_epoch_start                                                                                                                                                               |  0.0010873            |  2              |  0.0021746             |  0.001333             |
|  [Callback]TQDMProgressBar.on_validation_start                                                                                                                                                                |  0.00060386           |  3              |  0.0018116             |  0.0011104            |
|  [LightningModule]GCBC.on_train_epoch_start                                                                                                                                                                   |  0.00062183           |  2              |  0.0012437             |  0.00076232           |
|  [Callback]TQDMProgressBar.on_train_end                                                                                                                                                                       |  0.00070996           |  1              |  0.00070996            |  0.00043519           |
|  [Callback]ModelSummary.on_validation_batch_end                                                                                                                                                               |  3.1038e-05           |  22             |  0.00068284            |  0.00041856           |
|  [Callback]TQDMProgressBar.on_sanity_check_start                                                                                                                                                              |  0.00064962           |  1              |  0.00064962            |  0.0003982            |
|  [Callback]KLConstantSchedule.on_validation_batch_start                                                                                                                                                       |  2.4464e-05           |  22             |  0.0005382             |  0.0003299            |
|  [Callback]KLConstantSchedule.on_batch_start                                                                                                                                                                  |  1.8397e-05           |  20             |  0.00036793            |  0.00022553           |
|  [Callback]KLConstantSchedule.on_after_backward                                                                                                                                                               |  1.7703e-05           |  20             |  0.00035407            |  0.00021703           |
|  [Callback]TQDMProgressBar.on_train_start                                                                                                                                                                     |  0.00033575           |  1              |  0.00033575            |  0.00020581           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_end                     |  9.2307e-05           |  3              |  0.00027692            |  0.00016974           |
|  [Callback]KLConstantSchedule.on_before_optimizer_step                                                                                                                                                        |  1.1763e-05           |  20             |  0.00023526            |  0.00014421           |
|  [Callback]SignalCallback.on_validation_batch_start                                                                                                                                                           |  8.7329e-06           |  22             |  0.00019212            |  0.00011777           |
|  [Callback]ModelSummary.on_validation_batch_start                                                                                                                                                             |  8.3075e-06           |  22             |  0.00018276            |  0.00011203           |
|  [Callback]KLConstantSchedule.on_before_zero_grad                                                                                                                                                             |  8.9151e-06           |  20             |  0.0001783             |  0.00010929           |
|  [Callback]SignalCallback.on_fit_start                                                                                                                                                                        |  0.00017511           |  1              |  0.00017511            |  0.00010734           |
|  [Callback]KLConstantSchedule.on_validation_batch_end                                                                                                                                                         |  7.3507e-06           |  22             |  0.00016172            |  9.9127e-05           |
|  [LightningModule]GCBC.on_before_backward                                                                                                                                                                     |  7.4764e-06           |  20             |  0.00014953            |  9.1656e-05           |
|  [Callback]KLConstantSchedule.on_batch_end                                                                                                                                                                    |  7.4612e-06           |  20             |  0.00014922            |  9.147e-05            |
|  [Callback]SignalCallback.on_after_backward                                                                                                                                                                   |  7.2972e-06           |  20             |  0.00014594            |  8.946e-05            |
|  [Callback]KLConstantSchedule.on_before_backward                                                                                                                                                              |  6.6953e-06           |  20             |  0.00013391            |  8.208e-05            |
|  [Callback]LearningRateMonitor.on_validation_batch_start                                                                                                                                                      |  5.6677e-06           |  22             |  0.00012469            |  7.643e-05            |
|  [Callback]SignalCallback.on_before_optimizer_step                                                                                                                                                            |  5.9794e-06           |  20             |  0.00011959            |  7.3303e-05           |
|  [Callback]SignalCallback.on_validation_batch_end                                                                                                                                                             |  5.4276e-06           |  22             |  0.00011941            |  7.3193e-05           |
|  [Callback]LearningRateMonitor.on_after_backward                                                                                                                                                              |  5.7726e-06           |  20             |  0.00011545            |  7.0769e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_after_backward                     |  5.7529e-06           |  20             |  0.00011506            |  7.0528e-05           |
|  [Callback]GradientAccumulationScheduler.on_validation_batch_end                                                                                                                                              |  5.2079e-06           |  22             |  0.00011457            |  7.023e-05            |
|  [Callback]LearningRateMonitor.on_train_start                                                                                                                                                                 |  0.00011443           |  1              |  0.00011443            |  7.0145e-05           |
|  [Callback]GradientAccumulationScheduler.on_after_backward                                                                                                                                                    |  5.6747e-06           |  20             |  0.00011349            |  6.9568e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_batch_start             |  5.1261e-06           |  22             |  0.00011277            |  6.9127e-05           |
|  [Callback]LearningRateMonitor.on_before_optimizer_step                                                                                                                                                       |  5.4161e-06           |  20             |  0.00010832            |  6.6398e-05           |
|  [Callback]SignalCallback.on_batch_start                                                                                                                                                                      |  5.4023e-06           |  20             |  0.00010805            |  6.6229e-05           |
|  [Callback]TQDMProgressBar.on_after_backward                                                                                                                                                                  |  5.3818e-06           |  20             |  0.00010764            |  6.5977e-05           |
|  [Callback]SignalCallback.on_batch_end                                                                                                                                                                        |  5.2926e-06           |  20             |  0.00010585            |  6.4885e-05           |
|  [Callback]ModelSummary.on_before_optimizer_step                                                                                                                                                              |  5.2362e-06           |  20             |  0.00010472            |  6.4192e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_before_optimizer_step              |  5.2008e-06           |  20             |  0.00010402            |  6.3759e-05           |
|  [Callback]TQDMProgressBar.on_before_optimizer_step                                                                                                                                                           |  5.1992e-06           |  20             |  0.00010398            |  6.374e-05            |
|  [Callback]ModelSummary.on_after_backward                                                                                                                                                                     |  5.1827e-06           |  20             |  0.00010365            |  6.3536e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_batch_end               |  4.688e-06            |  22             |  0.00010314            |  6.3219e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_batch_end                          |  5.1272e-06           |  20             |  0.00010254            |  6.2856e-05           |
|  [Callback]GradientAccumulationScheduler.on_before_optimizer_step                                                                                                                                             |  5.0483e-06           |  20             |  0.00010097            |  6.189e-05            |
|  [Callback]GradientAccumulationScheduler.on_batch_start                                                                                                                                                       |  5.0287e-06           |  20             |  0.00010057            |  6.1648e-05           |
|  [Callback]LearningRateMonitor.on_batch_start                                                                                                                                                                 |  4.8955e-06           |  20             |  9.7911e-05            |  6.0016e-05           |
|  [Callback]GradientAccumulationScheduler.on_batch_end                                                                                                                                                         |  4.8606e-06           |  20             |  9.7212e-05            |  5.9588e-05           |
|  [Callback]LearningRateMonitor.on_validation_batch_end                                                                                                                                                        |  4.3611e-06           |  22             |  9.5945e-05            |  5.8811e-05           |
|  [Callback]SignalCallback.on_before_backward                                                                                                                                                                  |  4.6392e-06           |  20             |  9.2783e-05            |  5.6873e-05           |
|  [Callback]LearningRateMonitor.on_batch_end                                                                                                                                                                   |  4.628e-06            |  20             |  9.2561e-05            |  5.6737e-05           |
|  [Callback]GradientAccumulationScheduler.on_validation_batch_start                                                                                                                                            |  4.1629e-06           |  22             |  9.1583e-05            |  5.6138e-05           |
|  [Callback]SignalCallback.on_before_zero_grad                                                                                                                                                                 |  4.3882e-06           |  20             |  8.7764e-05            |  5.3797e-05           |
|  [Callback]ModelSummary.on_validation_end                                                                                                                                                                     |  2.9251e-05           |  3              |  8.7754e-05            |  5.3791e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_batch_start                        |  4.2837e-06           |  20             |  8.5673e-05            |  5.2515e-05           |
|  [Callback]ModelSummary.on_batch_end                                                                                                                                                                          |  4.2771e-06           |  20             |  8.5542e-05            |  5.2435e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_before_zero_grad                   |  4.183e-06            |  20             |  8.3661e-05            |  5.1282e-05           |
|  [Callback]TQDMProgressBar.on_batch_end                                                                                                                                                                       |  4.1562e-06           |  20             |  8.3124e-05            |  5.0952e-05           |
|  [Callback]TQDMProgressBar.on_batch_start                                                                                                                                                                     |  3.961e-06            |  20             |  7.922e-05             |  4.856e-05            |
|  [Callback]LearningRateMonitor.on_before_backward                                                                                                                                                             |  3.9247e-06           |  20             |  7.8493e-05            |  4.8114e-05           |
|  [Callback]ModelSummary.on_batch_start                                                                                                                                                                        |  3.8938e-06           |  20             |  7.7875e-05            |  4.7735e-05           |
|  [Callback]GradientAccumulationScheduler.on_before_zero_grad                                                                                                                                                  |  3.893e-06            |  20             |  7.7861e-05            |  4.7726e-05           |
|  [Callback]TQDMProgressBar.on_before_backward                                                                                                                                                                 |  3.8672e-06           |  20             |  7.7343e-05            |  4.7409e-05           |
|  [Callback]LearningRateMonitor.on_before_zero_grad                                                                                                                                                            |  3.8611e-06           |  20             |  7.7222e-05            |  4.7335e-05           |
|  [LightningModule]GCBC.on_validation_batch_start                                                                                                                                                              |  3.4864e-06           |  22             |  7.6701e-05            |  4.7015e-05           |
|  [Callback]ModelSummary.on_before_zero_grad                                                                                                                                                                   |  3.8142e-06           |  20             |  7.6283e-05            |  4.6759e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_before_backward                    |  3.801e-06            |  20             |  7.6021e-05            |  4.6599e-05           |
|  [Callback]TQDMProgressBar.on_before_zero_grad                                                                                                                                                                |  3.6946e-06           |  20             |  7.3891e-05            |  4.5293e-05           |
|  [Callback]ModelSummary.on_before_backward                                                                                                                                                                    |  3.5121e-06           |  20             |  7.0242e-05            |  4.3056e-05           |
|  [Callback]GradientAccumulationScheduler.on_before_backward                                                                                                                                                   |  3.4862e-06           |  20             |  6.9724e-05            |  4.2739e-05           |
|  [LightningModule]GCBC.on_train_batch_end                                                                                                                                                                     |  3.402e-06            |  20             |  6.804e-05             |  4.1706e-05           |
|  [LightningModule]GCBC.training_step_end                                                                                                                                                                      |  3.305e-06            |  20             |  6.61e-05              |  4.0517e-05           |
|  [LightningModule]GCBC.on_before_zero_grad                                                                                                                                                                    |  3.2048e-06           |  20             |  6.4095e-05            |  3.9288e-05           |
|  [Callback]KLConstantSchedule.on_validation_end                                                                                                                                                               |  2.1258e-05           |  3              |  6.3773e-05            |  3.9091e-05           |
|  [Callback]ModelSummary.on_train_epoch_end                                                                                                                                                                    |  3.1795e-05           |  2              |  6.3591e-05            |  3.8979e-05           |
|  [LightningModule]GCBC.validation_step_end                                                                                                                                                                    |  2.8019e-06           |  22             |  6.1642e-05            |  3.7785e-05           |
|  [LightningModule]GCBC.on_validation_batch_end                                                                                                                                                                |  2.796e-06            |  22             |  6.1512e-05            |  3.7705e-05           |
|  [Callback]KLConstantSchedule.on_epoch_end                                                                                                                                                                    |  1.205e-05            |  5              |  6.0252e-05            |  3.6933e-05           |
|  [LightningModule]GCBC.on_train_batch_start                                                                                                                                                                   |  2.9907e-06           |  20             |  5.9814e-05            |  3.6664e-05           |
|  [LightningModule]GCBC.on_after_backward                                                                                                                                                                      |  2.929e-06            |  20             |  5.8581e-05            |  3.5908e-05           |
|  [Strategy]DDPStrategy.validation_step_end                                                                                                                                                                    |  2.6555e-06           |  22             |  5.842e-05             |  3.581e-05            |
|  [Callback]KLConstantSchedule.on_epoch_start                                                                                                                                                                  |  1.0158e-05           |  5              |  5.0792e-05            |  3.1134e-05           |
|  [LightningModule]GCBC.on_before_optimizer_step                                                                                                                                                               |  2.5365e-06           |  20             |  5.0731e-05            |  3.1096e-05           |
|  [Callback]ModelSummary.on_train_epoch_start                                                                                                                                                                  |  2.3921e-05           |  2              |  4.7842e-05            |  2.9326e-05           |
|  [Strategy]DDPStrategy.on_train_batch_start                                                                                                                                                                   |  2.369e-06            |  20             |  4.738e-05             |  2.9043e-05           |
|  [Callback]SignalCallback.on_epoch_end                                                                                                                                                                        |  8.676e-06            |  5              |  4.338e-05             |  2.6591e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': None}.setup                                 |  4.2241e-05           |  1              |  4.2241e-05            |  2.5893e-05           |
|  [Callback]LearningRateMonitor.on_epoch_end                                                                                                                                                                   |  8.1663e-06           |  5              |  4.0831e-05            |  2.5028e-05           |
|  [Callback]SignalCallback.on_epoch_start                                                                                                                                                                      |  8.1402e-06           |  5              |  4.0701e-05            |  2.4949e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_epoch_end                          |  7.3404e-06           |  5              |  3.6702e-05            |  2.2497e-05           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_epoch_start                        |  7.3322e-06           |  5              |  3.6661e-05            |  2.2472e-05           |
|  [Strategy]DDPStrategy.training_step_end                                                                                                                                                                      |  1.768e-06            |  20             |  3.5359e-05            |  2.1674e-05           |
|  [Callback]GradientAccumulationScheduler.on_epoch_start                                                                                                                                                       |  6.8702e-06           |  5              |  3.4351e-05            |  2.1056e-05           |
|  [Callback]LearningRateMonitor.on_epoch_start                                                                                                                                                                 |  6.086e-06            |  5              |  3.043e-05             |  1.8653e-05           |
|  [Callback]KLConstantSchedule.on_train_epoch_start                                                                                                                                                            |  1.419e-05            |  2              |  2.838e-05             |  1.7396e-05           |
|  [Callback]ModelSummary.on_validation_start                                                                                                                                                                   |  8.5006e-06           |  3              |  2.5502e-05            |  1.5632e-05           |
|  [Callback]KLConstantSchedule.on_validation_epoch_end                                                                                                                                                         |  8.3103e-06           |  3              |  2.4931e-05            |  1.5282e-05           |
|  [Callback]GradientAccumulationScheduler.on_epoch_end                                                                                                                                                         |  4.9142e-06           |  5              |  2.4571e-05            |  1.5061e-05           |
|  [Callback]KLConstantSchedule.on_validation_start                                                                                                                                                             |  8.1844e-06           |  3              |  2.4553e-05            |  1.505e-05            |
|  [Callback]ModelSummary.on_epoch_end                                                                                                                                                                          |  4.814e-06            |  5              |  2.407e-05             |  1.4754e-05           |
|  [Callback]ModelSummary.on_epoch_start                                                                                                                                                                        |  4.724e-06            |  5              |  2.362e-05             |  1.4478e-05           |
|  [Callback]TQDMProgressBar.on_epoch_end                                                                                                                                                                       |  4.5182e-06           |  5              |  2.2591e-05            |  1.3848e-05           |
|  [Callback]LearningRateMonitor.on_train_epoch_start                                                                                                                                                           |  1.1021e-05           |  2              |  2.2042e-05            |  1.3511e-05           |
|  [LightningModule]GCBC.validation_epoch_end                                                                                                                                                                   |  7.2837e-06           |  3              |  2.1851e-05            |  1.3394e-05           |
|  [Callback]TQDMProgressBar.on_epoch_start                                                                                                                                                                     |  4.346e-06            |  5              |  2.173e-05             |  1.332e-05            |
|  [Callback]GradientAccumulationScheduler.on_validation_end                                                                                                                                                    |  6.2104e-06           |  3              |  1.8631e-05            |  1.142e-05            |
|  [Callback]SignalCallback.on_validation_start                                                                                                                                                                 |  6.0137e-06           |  3              |  1.8041e-05            |  1.1059e-05           |
|  [Callback]KLConstantSchedule.on_save_checkpoint                                                                                                                                                              |  8.8461e-06           |  2              |  1.7692e-05            |  1.0845e-05           |
|  [Callback]GradientAccumulationScheduler.on_train_epoch_start                                                                                                                                                 |  8.605e-06            |  2              |  1.721e-05             |  1.0549e-05           |
|  [Callback]KLConstantSchedule.on_validation_epoch_start                                                                                                                                                       |  5.347e-06            |  3              |  1.6041e-05            |  9.8326e-06           |
|  [Callback]SignalCallback.on_validation_end                                                                                                                                                                   |  5.2633e-06           |  3              |  1.579e-05             |  9.6787e-06           |
|  [Callback]ModelSummary.on_train_end                                                                                                                                                                          |  1.5701e-05           |  1              |  1.5701e-05            |  9.6242e-06           |
|  [Callback]GradientAccumulationScheduler.on_validation_start                                                                                                                                                  |  5.1167e-06           |  3              |  1.535e-05             |  9.4091e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_start                   |  5.0467e-06           |  3              |  1.514e-05             |  9.2804e-06           |
|  [Callback]TQDMProgressBar.on_sanity_check_end                                                                                                                                                                |  1.5061e-05           |  1              |  1.5061e-05            |  9.2319e-06           |
|  [Callback]LearningRateMonitor.on_validation_start                                                                                                                                                            |  4.9601e-06           |  3              |  1.488e-05             |  9.1211e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_epoch_end               |  4.7633e-06           |  3              |  1.429e-05             |  8.7593e-06           |
|  [Callback]SignalCallback.on_validation_epoch_end                                                                                                                                                             |  4.7004e-06           |  3              |  1.4101e-05            |  8.6436e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_validation_epoch_start             |  4.6899e-06           |  3              |  1.407e-05             |  8.6243e-06           |
|  [Callback]LearningRateMonitor.on_validation_epoch_end                                                                                                                                                        |  4.6834e-06           |  3              |  1.405e-05             |  8.6123e-06           |
|  [Callback]TQDMProgressBar.on_validation_epoch_end                                                                                                                                                            |  4.6736e-06           |  3              |  1.4021e-05            |  8.5944e-06           |
|  [Callback]ModelSummary.on_validation_epoch_start                                                                                                                                                             |  4.637e-06            |  3              |  1.3911e-05            |  8.527e-06            |
|  [Callback]SignalCallback.on_validation_epoch_start                                                                                                                                                           |  4.5234e-06           |  3              |  1.357e-05             |  8.3181e-06           |
|  [Callback]KLConstantSchedule.on_train_epoch_end                                                                                                                                                              |  6.76e-06             |  2              |  1.352e-05             |  8.2874e-06           |
|  [LightningModule]GCBC.on_epoch_end                                                                                                                                                                           |  2.6761e-06           |  5              |  1.338e-05             |  8.2017e-06           |
|  [Callback]GradientAccumulationScheduler.on_validation_epoch_start                                                                                                                                            |  4.3937e-06           |  3              |  1.3181e-05            |  8.0796e-06           |
|  [Callback]ModelSummary.on_validation_epoch_end                                                                                                                                                               |  4.3337e-06           |  3              |  1.3001e-05            |  7.9692e-06           |
|  [Callback]TQDMProgressBar.on_validation_epoch_start                                                                                                                                                          |  4.27e-06             |  3              |  1.281e-05             |  7.8522e-06           |
|  [Callback]LearningRateMonitor.on_validation_epoch_start                                                                                                                                                      |  4.2599e-06           |  3              |  1.278e-05             |  7.8337e-06           |
|  [Callback]GradientAccumulationScheduler.on_validation_epoch_end                                                                                                                                              |  4.2304e-06           |  3              |  1.2691e-05            |  7.7793e-06           |
|  [Callback]TQDMProgressBar.setup                                                                                                                                                                              |  1.263e-05            |  1              |  1.263e-05             |  7.7419e-06           |
|  [Callback]GradientAccumulationScheduler.on_train_epoch_end                                                                                                                                                   |  6.3105e-06           |  2              |  1.2621e-05            |  7.7363e-06           |
|  [Callback]LearningRateMonitor.on_validation_end                                                                                                                                                              |  4.1804e-06           |  3              |  1.2541e-05            |  7.6874e-06           |
|  [LightningModule]GCBC.on_epoch_start                                                                                                                                                                         |  2.2219e-06           |  5              |  1.111e-05             |  6.8099e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_start                        |  1.103e-05            |  1              |  1.103e-05             |  6.7611e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_epoch_start                  |  5.2999e-06           |  2              |  1.06e-05              |  6.4974e-06           |
|  [Callback]SignalCallback.on_train_epoch_start                                                                                                                                                                |  5.0949e-06           |  2              |  1.019e-05             |  6.2461e-06           |
|  [Callback]SignalCallback.on_save_checkpoint                                                                                                                                                                  |  4.94e-06             |  2              |  9.8799e-06            |  6.0561e-06           |
|  [LightningModule]GCBC.configure_callbacks                                                                                                                                                                    |  9.8401e-06           |  1              |  9.8401e-06            |  6.0317e-06           |
|  [Callback]ModelSummary.on_save_checkpoint                                                                                                                                                                    |  4.6155e-06           |  2              |  9.231e-06             |  5.6584e-06           |
|  [Callback]GradientAccumulationScheduler.on_fit_start                                                                                                                                                         |  9.0799e-06           |  1              |  9.0799e-06            |  5.5657e-06           |
|  [Strategy]DDPStrategy.on_validation_end                                                                                                                                                                      |  2.9767e-06           |  3              |  8.9302e-06            |  5.474e-06            |
|  [Callback]ModelSummary.on_sanity_check_start                                                                                                                                                                 |  8.8899e-06           |  1              |  8.8899e-06            |  5.4493e-06           |
|  [Callback]ModelSummary.on_train_start                                                                                                                                                                        |  8.76e-06             |  1              |  8.76e-06              |  5.3696e-06           |
|  [Callback]SignalCallback.on_train_epoch_end                                                                                                                                                                  |  4.3655e-06           |  2              |  8.7309e-06            |  5.3518e-06           |
|  [Callback]TQDMProgressBar.on_save_checkpoint                                                                                                                                                                 |  4.3405e-06           |  2              |  8.6811e-06            |  5.3213e-06           |
|  [Callback]LearningRateMonitor.on_save_checkpoint                                                                                                                                                             |  4.095e-06            |  2              |  8.1901e-06            |  5.0203e-06           |
|  [Callback]KLConstantSchedule.on_train_end                                                                                                                                                                    |  8.0599e-06           |  1              |  8.0599e-06            |  4.9405e-06           |
|  [Callback]GradientAccumulationScheduler.on_train_start                                                                                                                                                       |  7.72e-06             |  1              |  7.72e-06              |  4.7321e-06           |
|  [LightningModule]GCBC.on_validation_end                                                                                                                                                                      |  2.55e-06             |  3              |  7.6501e-06            |  4.6893e-06           |
|  [Callback]LearningRateMonitor.on_train_epoch_end                                                                                                                                                             |  3.74e-06             |  2              |  7.4799e-06            |  4.585e-06            |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_save_checkpoint                    |  3.7251e-06           |  2              |  7.4501e-06            |  4.5667e-06           |
|  [Callback]GradientAccumulationScheduler.on_save_checkpoint                                                                                                                                                   |  3.5899e-06           |  2              |  7.1798e-06            |  4.401e-06            |
|  [Callback]KLConstantSchedule.on_pretrain_routine_start                                                                                                                                                       |  7.041e-06            |  1              |  7.041e-06             |  4.3159e-06           |
|  [Callback]KLConstantSchedule.on_train_start                                                                                                                                                                  |  6.73e-06             |  1              |  6.73e-06              |  4.1253e-06           |
|  [Callback]KLConstantSchedule.on_sanity_check_end                                                                                                                                                             |  6.1609e-06           |  1              |  6.1609e-06            |  3.7765e-06           |
|  [Callback]GradientAccumulationScheduler.on_train_end                                                                                                                                                         |  6.15e-06             |  1              |  6.15e-06              |  3.7698e-06           |
|  [LightningModule]GCBC.prepare_data                                                                                                                                                                           |  6.1211e-06           |  1              |  6.1211e-06            |  3.7521e-06           |
|  [Callback]LearningRateMonitor.on_fit_start                                                                                                                                                                   |  6.011e-06            |  1              |  6.011e-06             |  3.6846e-06           |
|  [Callback]KLConstantSchedule.on_fit_end                                                                                                                                                                      |  5.98e-06             |  1              |  5.98e-06              |  3.6656e-06           |
|  [Callback]SignalCallback.on_train_end                                                                                                                                                                        |  5.9502e-06           |  1              |  5.9502e-06            |  3.6473e-06           |
|  [LightningModule]GCBC.on_train_start                                                                                                                                                                         |  5.8399e-06           |  1              |  5.8399e-06            |  3.5797e-06           |
|  [LightningModule]GCBC.on_validation_start                                                                                                                                                                    |  1.9204e-06           |  3              |  5.7612e-06            |  3.5314e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_train_end                          |  5.76e-06             |  1              |  5.76e-06              |  3.5307e-06           |
|  [Callback]KLConstantSchedule.on_fit_start                                                                                                                                                                    |  5.6999e-06           |  1              |  5.6999e-06            |  3.4939e-06           |
|  [Strategy]DDPStrategy.on_validation_start                                                                                                                                                                    |  1.867e-06            |  3              |  5.601e-06             |  3.4332e-06           |
|  [Callback]SignalCallback.on_pretrain_routine_start                                                                                                                                                           |  5.2601e-06           |  1              |  5.2601e-06            |  3.2243e-06           |
|  [Callback]SignalCallback.on_train_start                                                                                                                                                                      |  5.15e-06             |  1              |  5.15e-06              |  3.1568e-06           |
|  [Callback]KLConstantSchedule.on_sanity_check_start                                                                                                                                                           |  5e-06                |  1              |  5e-06                 |  3.0649e-06           |
|  [Callback]ModelSummary.on_sanity_check_end                                                                                                                                                                   |  4.9002e-06           |  1              |  4.9002e-06            |  3.0036e-06           |
|  [Callback]KLConstantSchedule.setup                                                                                                                                                                           |  4.8e-06              |  1              |  4.8e-06               |  2.9423e-06           |
|  [Callback]LearningRateMonitor.on_train_end                                                                                                                                                                   |  4.76e-06             |  1              |  4.76e-06              |  2.9177e-06           |
|  [Callback]GradientAccumulationScheduler.on_sanity_check_start                                                                                                                                                |  4.6499e-06           |  1              |  4.6499e-06            |  2.8502e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_fit_start                          |  4.58e-06             |  1              |  4.58e-06              |  2.8074e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_sanity_check_start                 |  4.2499e-06           |  1              |  4.2499e-06            |  2.605e-06            |
|  [Callback]LearningRateMonitor.on_pretrain_routine_start                                                                                                                                                      |  4.1202e-06           |  1              |  4.1202e-06            |  2.5255e-06           |
|  [Callback]SignalCallback.on_sanity_check_start                                                                                                                                                               |  4.0801e-06           |  1              |  4.0801e-06            |  2.501e-06            |
|  [Callback]KLConstantSchedule.on_pretrain_routine_end                                                                                                                                                         |  3.96e-06             |  1              |  3.96e-06              |  2.4274e-06           |
|  [Callback]GradientAccumulationScheduler.on_sanity_check_end                                                                                                                                                  |  3.95e-06             |  1              |  3.95e-06              |  2.4212e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_sanity_check_end                   |  3.95e-06             |  1              |  3.95e-06              |  2.4212e-06           |
|  [LightningModule]GCBC.on_save_checkpoint                                                                                                                                                                     |  1.975e-06            |  2              |  3.95e-06              |  2.4212e-06           |
|  [Callback]SignalCallback.on_sanity_check_end                                                                                                                                                                 |  3.9199e-06           |  1              |  3.9199e-06            |  2.4028e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_pretrain_routine_start             |  3.8701e-06           |  1              |  3.8701e-06            |  2.3723e-06           |
|  [Callback]TQDMProgressBar.on_fit_start                                                                                                                                                                       |  3.8098e-06           |  1              |  3.8098e-06            |  2.3353e-06           |
|  [Callback]LearningRateMonitor.on_sanity_check_end                                                                                                                                                            |  3.74e-06             |  1              |  3.74e-06              |  2.2925e-06           |
|  [Callback]GradientAccumulationScheduler.on_pretrain_routine_start                                                                                                                                            |  3.6701e-06           |  1              |  3.6701e-06            |  2.2497e-06           |
|  [Callback]SignalCallback.on_pretrain_routine_end                                                                                                                                                             |  3.57e-06             |  1              |  3.57e-06              |  2.1883e-06           |
|  [Callback]TQDMProgressBar.on_pretrain_routine_start                                                                                                                                                          |  3.54e-06             |  1              |  3.54e-06              |  2.1699e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_pretrain_routine_end               |  3.5202e-06           |  1              |  3.5202e-06            |  2.1578e-06           |
|  [Callback]LearningRateMonitor.on_sanity_check_start                                                                                                                                                          |  3.5199e-06           |  1              |  3.5199e-06            |  2.1576e-06           |
|  [Callback]ModelSummary.on_pretrain_routine_start                                                                                                                                                             |  3.441e-06            |  1              |  3.441e-06             |  2.1092e-06           |
|  [Callback]KLConstantSchedule.on_before_accelerator_backend_setup                                                                                                                                             |  3.4301e-06           |  1              |  3.4301e-06            |  2.1025e-06           |
|  [Callback]LearningRateMonitor.on_pretrain_routine_end                                                                                                                                                        |  3.421e-06            |  1              |  3.421e-06             |  2.097e-06            |
|  [Callback]TQDMProgressBar.on_pretrain_routine_end                                                                                                                                                            |  3.41e-06             |  1              |  3.41e-06              |  2.0903e-06           |
|  [Callback]ModelSummary.on_pretrain_routine_end                                                                                                                                                               |  3.35e-06             |  1              |  3.35e-06              |  2.0534e-06           |
|  [Callback]GradientAccumulationScheduler.on_pretrain_routine_end                                                                                                                                              |  3.34e-06             |  1              |  3.34e-06              |  2.0473e-06           |
|  [Callback]KLConstantSchedule.on_configure_sharded_model                                                                                                                                                      |  2.79e-06             |  1              |  2.79e-06              |  1.7102e-06           |
|  [Callback]KLConstantSchedule.teardown                                                                                                                                                                        |  2.5302e-06           |  1              |  2.5302e-06            |  1.5509e-06           |
|  [Callback]SignalCallback.setup                                                                                                                                                                               |  2.4801e-06           |  1              |  2.4801e-06            |  1.5202e-06           |
|  [Callback]TQDMProgressBar.on_before_accelerator_backend_setup                                                                                                                                                |  2.42e-06             |  1              |  2.42e-06              |  1.4834e-06           |
|  [LightningModule]GCBC.on_train_end                                                                                                                                                                           |  2.0701e-06           |  1              |  2.0701e-06            |  1.2689e-06           |
|  [LightningModule]GCBC.on_train_dataloader                                                                                                                                                                    |  2.0599e-06           |  1              |  2.0599e-06            |  1.2626e-06           |
|  [Callback]SignalCallback.on_configure_sharded_model                                                                                                                                                          |  2.0498e-06           |  1              |  2.0498e-06            |  1.2565e-06           |
|  [Callback]ModelSummary.setup                                                                                                                                                                                 |  1.98e-06             |  1              |  1.98e-06              |  1.2137e-06           |
|  [Callback]SignalCallback.on_before_accelerator_backend_setup                                                                                                                                                 |  1.97e-06             |  1              |  1.97e-06              |  1.2075e-06           |
|  [Callback]LearningRateMonitor.setup                                                                                                                                                                          |  1.9399e-06           |  1              |  1.9399e-06            |  1.1891e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': None}.on_before_accelerator_backend_setup   |  1.9108e-06           |  1              |  1.9108e-06            |  1.1713e-06           |
|  [Callback]SignalCallback.on_fit_end                                                                                                                                                                          |  1.8999e-06           |  1              |  1.8999e-06            |  1.1646e-06           |
|  [Callback]LearningRateMonitor.on_configure_sharded_model                                                                                                                                                     |  1.8501e-06           |  1              |  1.8501e-06            |  1.134e-06            |
|  [LightningModule]GCBC.setup                                                                                                                                                                                  |  1.76e-06             |  1              |  1.76e-06              |  1.0788e-06           |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_fit_end                            |  1.7299e-06           |  1              |  1.7299e-06            |  1.0604e-06           |
|  [Callback]ModelSummary.on_configure_sharded_model                                                                                                                                                            |  1.7202e-06           |  1              |  1.7202e-06            |  1.0544e-06           |
|  [Callback]TQDMProgressBar.on_configure_sharded_model                                                                                                                                                         |  1.6999e-06           |  1              |  1.6999e-06            |  1.042e-06            |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.on_configure_sharded_model            |  1.6901e-06           |  1              |  1.6901e-06            |  1.036e-06            |
|  [Callback]LearningRateMonitor.on_fit_end                                                                                                                                                                     |  1.6699e-06           |  1              |  1.6699e-06            |  1.0236e-06           |
|  [Callback]GradientAccumulationScheduler.on_configure_sharded_model                                                                                                                                           |  1.63e-06             |  1              |  1.63e-06              |  9.9917e-07           |
|  [Callback]ModelSummary.on_fit_end                                                                                                                                                                            |  1.6289e-06           |  1              |  1.6289e-06            |  9.9846e-07           |
|  [Callback]LearningRateMonitor.on_before_accelerator_backend_setup                                                                                                                                            |  1.62e-06             |  1              |  1.62e-06              |  9.9303e-07           |
|  [Callback]SignalCallback.teardown                                                                                                                                                                            |  1.62e-06             |  1              |  1.62e-06              |  9.9303e-07           |
|  [Callback]GradientAccumulationScheduler.setup                                                                                                                                                                |  1.5891e-06           |  1              |  1.5891e-06            |  9.7405e-07           |
|  [Callback]TQDMProgressBar.on_fit_end                                                                                                                                                                         |  1.5299e-06           |  1              |  1.5299e-06            |  9.378e-07            |
|  [Callback]ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None, 'save_on_train_epoch_end': True}.teardown                              |  1.5199e-06           |  1              |  1.5199e-06            |  9.3167e-07           |
|  [Callback]GradientAccumulationScheduler.on_before_accelerator_backend_setup                                                                                                                                  |  1.5099e-06           |  1              |  1.5099e-06            |  9.2553e-07           |
|  [Callback]LearningRateMonitor.teardown                                                                                                                                                                       |  1.5099e-06           |  1              |  1.5099e-06            |  9.2553e-07           |
|  [Callback]ModelSummary.teardown                                                                                                                                                                              |  1.5099e-06           |  1              |  1.5099e-06            |  9.2553e-07           |
|  [LightningModule]GCBC.on_fit_end                                                                                                                                                                             |  1.4901e-06           |  1              |  1.4901e-06            |  9.134e-07            |
|  [LightningModule]GCBC.configure_sharded_model                                                                                                                                                                |  1.4701e-06           |  1              |  1.4701e-06            |  9.0112e-07           |
|  [Callback]GradientAccumulationScheduler.on_fit_end                                                                                                                                                           |  1.4699e-06           |  1              |  1.4699e-06            |  9.0098e-07           |
|  [Callback]ModelSummary.on_before_accelerator_backend_setup                                                                                                                                                   |  1.4601e-06           |  1              |  1.4601e-06            |  8.9499e-07           |
|  [Callback]GradientAccumulationScheduler.teardown                                                                                                                                                             |  1.4e-06              |  1              |  1.4e-06               |  8.5817e-07           |
|  [Callback]TQDMProgressBar.teardown                                                                                                                                                                           |  1.39e-06             |  1              |  1.39e-06              |  8.5203e-07           |
|  [Strategy]DDPStrategy.on_train_end                                                                                                                                                                           |  1.211e-06            |  1              |  1.211e-06             |  7.4228e-07           |
|  [LightningModule]GCBC.on_val_dataloader                                                                                                                                                                      |  1.2e-06              |  1              |  1.2e-06               |  7.3557e-07           |
|  [Strategy]DDPStrategy.on_train_start                                                                                                                                                                         |  1.16e-06             |  1              |  1.16e-06              |  7.1102e-07           |
|  [LightningModule]GCBC.on_pretrain_routine_start                                                                                                                                                              |  1.15e-06             |  1              |  1.15e-06              |  7.0489e-07           |
|  [LightningModule]GCBC.on_pretrain_routine_end                                                                                                                                                                |  8.1002e-07           |  1              |  8.1002e-07            |  4.9652e-07           |
|  [LightningModule]GCBC.teardown                                                                                                                                                                               |  7.4995e-07           |  1              |  7.4995e-07            |  4.597e-07            |
lukashermann commented 1 year ago

Could you provide the exact command that you used for running this? And is this on a SLURM cluster with 4x3090? I should have added that you should disable the rollout callbacks for profiling since they have computational overload in the first epochs. You can do that by appending ~callbacks/rollout ~callbacks/rollout_lh ~callbacks/tsne_plot to your run command.

mbreuss commented 1 year ago

Hi, I did disable the rollout callbacks for this run. The training was done on the cluster with 4x3090 but started without slurm. The command: /home/temp_store/miniconda3/envs/c_env/bin/python /home/temp_store/code/hulc/hulc/ +trainer.limit_train_batches=10 +trainer.limit_val_batches=10

The config I used:

    _target_: pytorch_lightning.callbacks.ModelCheckpoint
    save_top_k: -1
    verbose: true
    dirpath: saved_models
    filename: '{epoch}'
    _target_: hulc.utils.kl_callbacks.KLConstantSchedule
    _target_: hulc.datasets.utils.shared_memory_utils.SignalCallback
      _target_: hulc.datasets.disk_dataset.DiskDataset
      key: vis
      save_format: npz
      batch_size: 32
      min_window_size: 20
      max_window_size: 32
      proprio_state: ${datamodule.proprioception_dims}
      obs_space: ${datamodule.observation_space}
      pad: true
      lang_folder: lang_paraphrase-MiniLM-L3-v2
      num_workers: 2
      _target_: hulc.datasets.disk_dataset.DiskDataset
      key: lang
      save_format: npz
      batch_size: 32
      min_window_size: 20
      max_window_size: 32
      proprio_state: ${datamodule.proprioception_dims}
      obs_space: ${datamodule.observation_space}
      skip_frames: 1
      pad: true
      lang_folder: lang_paraphrase-MiniLM-L3-v2
      aux_lang_loss_window: 8
      num_workers: 2
      - _target_: torchvision.transforms.Resize
        size: 200
      - _target_: hulc.utils.transforms.RandomShiftsAug
        pad: 10
      - _target_: hulc.utils.transforms.ScaleImageTensor
      - _target_: torchvision.transforms.Normalize
        - 0.5
        - 0.5
      - _target_: torchvision.transforms.Resize
        size: 84
      - _target_: hulc.utils.transforms.RandomShiftsAug
        pad: 4
      - _target_: hulc.utils.transforms.ScaleImageTensor
      - _target_: torchvision.transforms.Normalize
        - 0.5
        - 0.5
      - _target_: torchvision.transforms.Resize
        size: 200
      - _target_: hulc.utils.transforms.AddDepthNoise
        - 1000.0
        - 1000.0
      - _target_: torchvision.transforms.Resize
        size: 84
      - _target_: hulc.utils.transforms.AddGaussianNoise
        - 0.0
        - 0.01
      - _target_: torchvision.transforms.Resize
        size: 70
      - _target_: torchvision.transforms.RandomCrop
        size: 64
      - _target_: hulc.utils.transforms.ScaleImageTensor
      - _target_: torchvision.transforms.Normalize
        - 0.5
        - 0.5
      - _target_: torchvision.transforms.Resize
        size: 64
      - _target_: torchvision.transforms.Normalize
        - 0.1
        - 0.2
      - _target_: hulc.utils.transforms.NormalizeVector
      - _target_: hulc.utils.transforms.NormalizeVector
      - _target_: torchvision.transforms.Resize
        size: 200
      - _target_: hulc.utils.transforms.ScaleImageTensor
      - _target_: torchvision.transforms.Normalize
        - 0.5
        - 0.5
      - _target_: torchvision.transforms.Resize
        size: 84
      - _target_: hulc.utils.transforms.ScaleImageTensor
      - _target_: torchvision.transforms.Normalize
        - 0.5
        - 0.5
      - _target_: torchvision.transforms.Resize
        size: 200
      - _target_: torchvision.transforms.Resize
        size: 84
      - _target_: torchvision.transforms.Resize
        size: 70
      - _target_: torchvision.transforms.RandomCrop
        size: 64
      - _target_: hulc.utils.transforms.ScaleImageTensor
      - _target_: torchvision.transforms.Normalize
        - 0.5
        - 0.5
      - _target_: torchvision.transforms.Resize
        size: 64
      - _target_: torchvision.transforms.Normalize
        - 0.1
        - 0.2
      - _target_: hulc.utils.transforms.NormalizeVector
      - _target_: hulc.utils.transforms.NormalizeVector
    n_state_obs: 8
    - - 0
      - 7
    - - 14
      - 15
    - 3
    - 6
    normalize: true
    normalize_robot_orientation: true
    - rgb_static
    - rgb_gripper
    depth_obs: []
    - robot_obs
    - rel_actions
    - language
  _target_: hulc.datasets.hulc_data_module.HulcDataModule
  _recursive_: false
  root_data_dir: /home/temp_store/calvin_data/task_D_D
  action_space: 7
  - 1.0
  - 1.0
  - 1.0
  - 1.0
  - 1.0
  - 1.0
  - 1.0
  - -1.0
  - -1.0
  - -1.0
  - -1.0
  - -1.0
  - -1.0
  - -1
  shuffle_val: false
      _target_: hulc.models.perceptual_encoders.vision_network.VisionNetwork
      input_width: 200
      input_height: 200
      activation_function: ReLU
      dropout_vis_fc: 0.0
      l2_normalize_output: false
      visual_features: 64
      num_c: 3
      use_sinusoid: false
      spatial_softmax_temp: 1.0
      _target_: hulc.models.perceptual_encoders.vision_network_gripper.VisionNetwork
      input_width: 84
      input_height: 84
      activation_function: ReLU
      dropout_vis_fc: 0.0
      l2_normalize_output: false
      visual_features: 64
      conv_encoder: nature_cnn
      num_c: 3
    depth_static: {}
    depth_gripper: {}
    proprio: {}
    tactile: {}
    _target_: hulc.models.perceptual_encoders.concat_encoders.ConcatEncoders
    _recursive_: false
    _target_: hulc.models.plan_encoders.plan_proposal_net.PlanProposalNetwork
    perceptual_features: ???
    latent_goal_features: ${model.visual_goal.latent_goal_features}
    plan_features: ???
    activation_function: ReLU
    hidden_size: 2048
    _target_: hulc.models.plan_encoders.plan_recognition_net.PlanRecognitionTransformersNetwork
    num_heads: 8
    num_layers: 2
    encoder_hidden_size: 2048
    fc_hidden_size: 4096
    in_features: ??
    plan_features: ???
    action_space: ${datamodule.action_space}
    dropout_p: 0.1
    encoder_normalize: false
    positional_normalize: false
    position_embedding: true
    max_position_embeddings: ${datamodule.datasets.lang_dataset.max_window_size}
    _target_: hulc.utils.distributions.Distribution
    dist: discrete
    category_size: 32
    class_size: 32
    _target_: hulc.models.encoders.goal_encoders.VisualGoalEncoder
    in_features: ???
    hidden_size: 2048
    latent_goal_features: 32
    l2_normalize_goal_embeddings: false
    activation_function: ReLU
    _target_: hulc.models.encoders.goal_encoders.LanguageGoalEncoder
    in_features: 384
    hidden_size: 2048
    latent_goal_features: 32
    l2_normalize_goal_embeddings: false
    activation_function: ReLU
    word_dropout_p: 0.0
    _target_: hulc.models.decoders.logistic_decoder_rnn.LogisticDecoderRNN
    n_mixtures: 10
    hidden_size: 2048
    out_features: ${datamodule.action_space}
    log_scale_min: -7.0
    act_max_bound: ${datamodule.action_max}
    act_min_bound: ${datamodule.action_min}
    dataset_dir: ${datamodule.root_data_dir}
    load_action_bounds: false
    num_classes: 10
    latent_goal_features: ${model.visual_goal.latent_goal_features}
    plan_features: ???
    perceptual_features: ???
    gripper_alpha: 1.0
    - 64
    - 128
    policy_rnn_dropout_p: 0.0
    num_layers: 2
    rnn_model: rnn_decoder
    gripper_control: true
    discrete_gripper: true
    _target_: torch.optim.Adam
    lr: ${}
    _target_: transformers.get_constant_schedule
  bc_z_lang_decoder: {}
  mia_lang_discriminator: {}
    _target_: hulc.models.auxiliary_loss_networks.proj_vis_lang.ProjVisLang
    im_dim: ${model.plan_recognition.fc_hidden_size}
    lang_dim: ${model.language_goal.latent_goal_features}
    output_dim: ${model.language_goal.latent_goal_features}
    proj_lang: true
    - take the red block and rotate it to the right
    - take the red block and rotate it to the left
    - take the blue block and rotate it to the right
    - take the blue block and rotate it to the left
    - take the pink block and rotate it to the right
    - take the pink block and rotate it to the left
    - go push the red block right
    - go push the red block left
    - go push the blue block right
    - go push the blue block left
    - go push the pink block right
    - go push the pink block left
    - push the sliding door to the left side
    - push the sliding door to the right side
    - pull the handle to open the drawer
    - push the handle to close the drawer
    - grasp and lift the red block
    - grasp and lift the blue block
    - grasp and lift the pink block
    - lift the red block from the sliding cabinet
    - lift the blue block from the sliding cabinet
    - lift the pink block from the sliding cabinet
    - Take the red block from the drawer
    - Take the blue block from the drawer
    - Take the pink block from the drawer
    - store the grasped block in the sliding cabinet
    - store the grasped block in the drawer
    - slide the block that it falls into the drawer
    - stack the grasped block
    - remove the stacked block
    - use the switch to turn on the light bulb
    - use the switch to turn off the light bulb
    - press the button to turn on the led light
    - press the button to turn off the led light
  _target_: hulc.models.gcbc.GCBC
  _recursive_: false
  kl_beta: ${loss.kl_beta}
  kl_balancing_mix: ${loss.kl_balancing_mix}
  state_recons: false
  state_recon_beta: ${loss.state_recon_beta}
  use_bc_z_auxiliary_loss: false
  bc_z_auxiliary_loss_beta: ${loss.bc_z_auxiliary_loss_beta}
  use_mia_auxiliary_loss: false
  mia_auxiliary_loss_beta: ${loss.mia_auxiliary_loss_beta}
  replan_freq: 30
  use_clip_auxiliary_loss: true
  clip_auxiliary_loss_beta: ${loss.clip_auxiliary_loss_beta}
  kl_beta: 0.01
  state_recon_beta: 0.5
  kl_balancing_mix: 0.8
  bc_z_auxiliary_loss_beta: 1.0
  mia_auxiliary_loss_beta: 1.0
  clip_auxiliary_loss_beta: 3.0
  lr: 0.0002
  gpus: 4
  precision: 16
  val_check_interval: 1.0
  max_epochs: 2
  sync_batchnorm: false
  limit_train_batches: 10
  limit_val_batches: 10
  _target_: pytorch_lightning.loggers.WandbLogger
  save_dir: .
  name: play_lmp
  group: play_lmp
  log_model: false
  project: calvin_vision
  id: ???
seed: 42
log_dir: ../
slurm: false
lukashermann commented 1 year ago

Just to make sure this is not the issue, how do you start your interactive slurm session? Did you reserve enough CPU cores? What is the output in the terminal for nproc?

mbreuss commented 1 year ago

Right now, I am using the cluster like a normal PC with 4 GPUs and I am running the training script directly in python without slurm, since I am the only one using it. I have started jobs with the slurm variant but I had the same problems. The return of nproc is 48. During my attempts to increase the number of workers, the training crashed. I think the limiting factor related to this could be the 128 GB of RAM. During training all GPUs are found and in use, however their average load is usually close to 10% at maximum and the percentage of used memory is also similar.

lukashermann commented 1 year ago

How fast does it run if you only use 1 or 2 GPUs for example?

mbreuss commented 1 year ago

Similar speed. I am testing some models using the tacorl dataset and code and here the average training time for an epoch is around 1h. I know the dataset is a bit smaller and without language but here it seems to work fine there.

hk-zh commented 1 year ago

Hi @mbreuss, did you maybe run the shared memory of a smaller debug dataset before? Try to delete the shared memory in /dev/shm/, they are called /dev/shm/train_* and /dev/shm/val_*. Also delete the train_shm_lookup.npy and the val_shm_lookup.npy in tmp or slurm_temp directory (see here).

It's weird that it takes so long without the shared memory, there definitely seems to be something wrong. Shared memory gave us a speed-up of max. 50%. From where do you load the dataset, is it maybe via a slow network mount? Try running the PyTorch-lightning profiler to see where's the bottleneck, you can also post the results here and I compare them to our cluster. For debugging, you might want to use the hydra cmd line flags +trainer.limit_train_batches=10 and +trainer.limit_val_batches=10 (or a similar number), then you don't have to wait for the whole epoch, but you still get an estimate of the time.

I also get the key error when executing I have already deleted train_shm_lookup.npy and val_shm_lookup.npy also /dev/shm/train_* and /dev/shm/val_*. However, I still get the keyerror.

ps. I was using the ABC dataset to train

updates: I found the problem. Since the ABC dataset is very large, the mounted /dev/shm is not big enough to load all episodes. One solution is enlarging the/dev/shm by remount command if the server has enough memory at least 200GB.

mbreuss commented 1 year ago

Thanks for pointing that out. My memory is also small, I will check if I have the same problem.