open-mmlab / mmskeleton

A OpenMMLAB toolbox for human pose estimation, skeleton-based action recognition, and action synthesis.
Apache License 2.0
2.92k stars 1.03k forks source link

getting memory limit exceeded #108

Open neerajbattan opened 5 years ago

neerajbattan commented 5 years ago

I'm getting memory limit exceeded even after allocating 125GB for the training of ntu_xview.

Getting this error

[11.14.18|20:57:54] Parameters: {'print_log': True, 'log_interval': 100, 'step': [10, 50], 'save_result': False, 'test_feeder_args': {'data_path': '/scratch/neeraj.b/data/NTU-RGB-D/xview/val_data.npy', 'label_path': '/scratch/neeraj.b/data/NTU-RGB-D/xview/val_label.pkl'}, 'optimizer': 'SGD', 'start_epoch': 0, 'batch_size': 64, 'phase': 'train', 'base_lr': 0.1, 'num_worker': 4, 'debug': False, 'weights': None, 'save_interval': 10, 'pavi_log': False, 'ignore_weights': [], 'save_log': True, 'num_epoch': 80, 'use_gpu': True, 'weight_decay': 0.0001, 'test_batch_size': 64, 'device': [0, 1, 2, 3], 'nesterov': True, 'feeder': 'feeder.feeder.Feeder', 'train_feeder_args': {'data_path': '/scratch/neeraj.b/data/NTU-RGB-D/xview/train_data.npy', 'label_path': '/scratch/neeraj.b/data/NTU-RGB-D/xview/train_label.pkl', 'debug': False}, 'eval_interval': 5, 'config': 'config/st_gcn/ntu-xview/train.yaml', 'work_dir': '/scratch/neeraj.b/data/work_dir', 'model_args': {'num_class': 60, 'graph_args': {'strategy': 'spatial', 'layout': 'ntu-rgb+d'}, 'edge_importance_weighting': True, 'dropout': 0.5, 'in_channels': 3}, 'show_topk': [1, 5], 'model': 'net.st_gcn.Model'}

[11.14.18|20:57:54] Training epoch: 0 slurmstepd: Job 157430 exceeded memory limit (131272228 > 131072000), being killed slurmstepd: Exceeded job memory limit

yysijie commented 5 years ago

Hi, did you use the latest version? It only needs 6GB memory.

Sureshthommandru commented 5 years ago

(base) F:\Suresh\st-gcn>python main1.py recognition -c config/st_gcn/ntu-xsub/train.yaml --device 0 --work_dir ./work_dir C:\Users\cudalab10\Anaconda3\lib\site-packages\torch\cuda__init__.py:117: UserWarning: Found GPU0 TITAN Xp which is of cuda capability 1.1. PyTorch no longer supports this GPU because it is too old.

warnings.warn(old_gpu_warn % (d, name, major, capability[1])) [05.22.19|12:02:41] Parameters: {'base_lr': 0.1, 'ignore_weights': [], 'model': 'net.st_gcn.Model', 'eval_interval': 5, 'weight_decay': 0.0001, 'work_dir': './work_dir', 'save_interval': 10, 'model_args': {'in_channels': 3, 'dropout': 0.5, 'num_class': 60, 'edge_importance_weighting': True, 'graph_args': {'strategy': 'spatial', 'layout': 'ntu-rgb+d'}}, 'debug': False, 'pavi_log': False, 'save_result': False, 'config': 'config/st_gcn/ntu-xsub/train.yaml', 'optimizer': 'SGD', 'weights': None, 'num_epoch': 80, 'batch_size': 64, 'show_topk': [1, 5], 'test_batch_size': 64, 'step': [10, 50], 'use_gpu': True, 'phase': 'train', 'print_log': True, 'log_interval': 100, 'feeder': 'feeder.feeder.Feeder', 'start_epoch': 0, 'nesterov': True, 'device': [0], 'save_log': True, 'test_feeder_args': {'data_path': './data/NTU-RGB-D/xsub/val_data.npy', 'label_path': './data/NTU-RGB-D/xsub/val_label.pkl'}, 'train_feeder_args': {'data_path': './data/NTU-RGB-D/xsub/train_data.npy', 'debug': False, 'label_path': './data/NTU-RGB-D/xsub/train_label.pkl'}, 'num_worker': 4}

[05.22.19|12:02:41] Training epoch: 0 Traceback (most recent call last): File "main1.py", line 31, in p.start() File "F:\Suresh\st-gcn\processor\processor.py", line 113, in start self.train() File "F:\Suresh\st-gcn\processor\recognition.py", line 91, in train output = self.model(data) File "C:\Users\cudalab10\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(*input, kwargs) File "F:\Suresh\st-gcn\net\stgcn.py", line 82, in forward x, = gcn(x, self.A importance) File "C:\Users\cudalab10\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(input, kwargs) File "F:\Suresh\st-gcn\net\st_gcn.py", line 194, in forward x, A = self.gcn(x, A) File "C:\Users\cudalab10\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(*input, *kwargs) File "F:\Suresh\st-gcn\net\utils\tgcn.py", line 60, in forward x = self.conv(x) File "C:\Users\cudalab10\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in call result = self.forward(input, **kwargs) File "C:\Users\cudalab10\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA out of memory. Tried to allocate 1.37 GiB (GPU 0; 12.00 GiB total capacity; 8.28 GiB already allocated; 652.75 MiB free; 664.38 MiB cached)

here is the task manager screen-shot. image