Closed amituofo1996 closed 1 year ago
Hi when I reproduce repr_table6_h36m,face following problem, My compute memory is 55 GB, and I close predicting vertes for it use too much momery.
InstaVariety number of dataset objects 130431 MPII3D Dataset overlap ratio: 0 Loaded mpii3d dataset from data/preprocessed_data/mpii3d_train_scale1_db.pt is_train: True mpii3d - number of dataset objects 59934 Human36M Dataset overlap ratio: 0 Loaded h36m dataset from data/preprocessed_data/h36m_train_25fps_tight_db.pt is_train: True h36m - number of dataset objects 48456 Human36M Dataset overlap ratio: 0.9375 Loaded h36m dataset from data/preprocessed_data/h36m_test_front_25fps_tight_db.pt is_train: False h36m - number of dataset objects 68416 => loaded pretrained model from 'data/base_data/spin_model_checkpoint.pth.tar' => no checkpoint found at '' Epoch 1/45 (500/500) | Total: 0:03:51 | ETA: 0:00:01 | loss: 9.57 | 2d: 3.23 | 3d: 4.53 | loss_kp_2d: 1.615 | loss_kp_3d: 0.990 | loss_shape: 0.022 | loss_pose: 0.682 | data: 1.17 | forward: 0.03 | loss: 0.01 | backward: 0.07 | batch: 1.28 (2138/2138) | batch: 1.201e+03ms | Total: 0:02:00 | ETA: 0:00:01 Evaluating on 68416 number of poses... Learning rate 5e-05 Learning rate 0.0001 Epoch 0, MPJPE: 84.9076, PA-MPJPE: 57.8166, ACCEL: 2.5355, ACCEL_ERR: 3.3904, Epoch 1 performance: 57.8166 Best performance achived, saving it! Epoch 2/45 (500/500) | Total: 0:03:59 | ETA: 0:00:01 | loss: 3.98 | 2d: 2.00 | 3d: 1.38 | loss_kp_2d: 2.273 | loss_kp_3d: 1.975 | loss_shape: 0.020 | loss_pose: 0.572 | data: 0.56 | forward: 0.03 | loss: 0.01 | backward: 0.08 | batch: 0.68 (2138/2138) | batch: 1.197e+03ms | Total: 0:01:59 | ETA: 0:00:01 Evaluating on 68416 number of poses... Epoch 1, MPJPE: 77.5908, PA-MPJPE: 49.9525, ACCEL: 2.5912, ACCEL_ERR: 3.3052, Epoch 2 performance: 49.9525 Learning rate 5e-05 Learning rate 0.0001 Best performance achived, saving it! Epoch 3/45 (500/500) | Total: 0:03:56 | ETA: 0:00:01 | loss: 3.34 | 2d: 1.70 | 3d: 1.18 | loss_kp_2d: 3.205 | loss_kp_3d: 1.627 | loss_shape: 0.018 | loss_pose: 0.582 | data: 0.01 | forward: 0.03 | loss: 0.01 | backward: 0.08 | batch: 0.12 (2138/2138) | batch: 1.238e+03ms | Total: 0:02:03 | ETA: 0:00:01 Evaluating on 68416 number of poses... Epoch 2, MPJPE: 72.0071, PA-MPJPE: 48.5966, ACCEL: 2.8021, ACCEL_ERR: 3.3457, Epoch 3 performance: 48.5966 Learning rate 5e-05 Learning rate 0.0001 Best performance achived, saving it! Epoch 4/45 (500/500) | Total: 0:03:48 | ETA: 0:00:01 | loss: 3.12 | 2d: 1.69 | 3d: 1.06 | loss_kp_2d: 1.572 | loss_kp_3d: 2.059 | loss_shape: 0.017 | loss_pose: 0.360 | data: 0.01 | forward: 0.03 | loss: 0.01 | backward: 0.07 | batch: 0.12 (2138/2138) | batch: 1.199e+03ms | Total: 0:01:59 | ETA: 0:00:01 Evaluating on 68416 number of poses... Epoch 3, MPJPE: 70.6277, PA-MPJPE: 47.0223, ACCEL: 2.6875, ACCEL_ERR: 3.2729, Epoch 4 performance: 47.0223 Learning rate 5e-05 Learning rate 0.0001 Best performance achived, saving it! Epoch 5/45 (500/500) | Total: 0:03:51 | ETA: 0:00:01 | loss: 2.92 | 2d: 1.61 | 3d: 0.98 | loss_kp_2d: 1.649 | loss_kp_3d: 1.081 | loss_shape: 0.012 | loss_pose: 0.321 | data: 0.50 | forward: 0.03 | loss: 0.01 | backward: 0.07 | batch: 0.61 Traceback (most recent call last): File "/media/xf/F/code/TCMR_RELEASE-master/train.py", line 141, in <module> main(cfg) File "/media/xf/F/code/TCMR_RELEASE-master/train.py", line 131, in main debug_freq=cfg.DEBUG_FREQ, File "/media/xf/F/code/TCMR_RELEASE-master/lib/core/trainer.py", line 343, in fit self.validate() File "/media/xf/F/code/TCMR_RELEASE-master/lib/core/trainer.py", line 291, in validate for i, target in enumerate(self.valid_loader): File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__ return _MultiProcessingDataLoaderIter(self) File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in __init__ w.start() File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/home/xf/miniconda3/envs/tcmr/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory
Could you give me some suggestion? Thanks!
Uhm.. Did you try discarding some data as I suggested here? https://github.com/hongsukchoi/TCMR_RELEASE/issues/29
I reduce num_worker from 16 to 4 and got simmilar result.
Hi when I reproduce repr_table6_h36m,face following problem, My compute memory is 55 GB, and I close predicting vertes for it use too much momery.
Could you give me some suggestion? Thanks!