microsoft / human-pose-estimation.pytorch

The project is an official implement of our ECCV2018 paper "Simple Baselines for Human Pose Estimation and Tracking(https://arxiv.org/abs/1804.06208)"
MIT License
2.94k stars 601 forks source link

out of memory when train and valid #107

Closed YuZijian closed 5 years ago

YuZijian commented 5 years ago

I meet OOM problem both train and valid, I decrease the batch size but it still happen, the problems seems happen when load pretrained models. the log of valid mpii dataset with resnet50 model is below. I can't find where the problem is.

=> creating output/mpii/pose_resnet_50/256x256_d256x3_adam_lr1e-3 => creating log/mpii/pose_resnet_50/256x256_d256x3_adam_lr1e-3_2019-04-11-14-26 Namespace(cfg='experiments/mpii/resnet50/256x256_d256x3_adamlr1e-3.yaml', coco bbox_file=None, flip_test=True, frequent=100, gpus=None, model_file='models/pyto rch/pose_mpii/pose_resnet_50_256x256.pth.tar', post_process=False, shift_heatmap =False, use_detect_bbox=False, workers=None) {'CUDNN': {'BENCHMARK': True, 'DETERMINISTIC': False, 'ENABLED': True}, 'DATASET': {'DATASET': 'mpii', 'DATA_FORMAT': 'jpg', 'FLIP': True, 'HYBRID_JOINTS_TYPE': '', 'ROOT': 'data/mpii/', 'ROT_FACTOR': 30, 'SCALE_FACTOR': 0.25, 'SELECT_DATA': False, 'TEST_SET': 'valid', 'TRAIN_SET': 'train'}, 'DATA_DIR': '', 'DEBUG': {'DEBUG': False, 'SAVE_BATCH_IMAGES_GT': True, 'SAVE_BATCH_IMAGES_PRED': True, 'SAVE_HEATMAPS_GT': True, 'SAVE_HEATMAPS_PRED': True}, 'GPUS': '8', 'LOG_DIR': 'log', 'LOSS': {'USE_TARGET_WEIGHT': True}, 'MODEL': {'EXTRA': {'DECONV_WITH_BIAS': False, 'FINAL_CONV_KERNEL': 1, 'HEATMAP_SIZE': array([64, 64]), 'NUM_DECONV_FILTERS': [256, 256, 256], 'NUM_DECONV_KERNELS': [4, 4, 4], 'NUM_DECONV_LAYERS': 3, 'NUM_LAYERS': 50, 'SIGMA': 2, 'TARGET_TYPE': 'gaussian'}, 'IMAGE_SIZE': array([256, 256]), 'INIT_WEIGHTS': True, 'NAME': 'pose_resnet', 'NUM_JOINTS': 16, 'PRETRAINED': 'models/pytorch/imagenet/resnet50-19c8e357.pth', 'STYLE': 'pytorch'}, 'OUTPUT_DIR': 'output', 'PRINT_FREQ': 100, 'TEST': {'BATCH_SIZE': 32, 'BBOX_THRE': 1.0, 'COCO_BBOX_FILE': '', 'FLIP_TEST': True, 'IMAGE_THRE': 0.0, 'IN_VIS_THRE': 0.0, 'MODEL_FILE': 'models/pytorch/pose_mpii/pose_resnet_50_256x256.pth.tar ', 'NMS_THRE': 1.0, 'OKS_THRE': 0.5, 'POST_PROCESS': True, 'SHIFT_HEATMAP': True, 'USE_GT_BBOX': False}, 'TRAIN': {'BATCH_SIZE': 32, 'BEGIN_EPOCH': 0, 'CHECKPOINT': '', 'END_EPOCH': 140, 'GAMMA1': 0.99, 'GAMMA2': 0.0, 'LR': 0.001, 'LR_FACTOR': 0.1, 'LR_STEP': [90, 120], 'MOMENTUM': 0.9, 'NESTEROV': False, 'OPTIMIZER': 'adam', 'RESUME': False, 'SHUFFLE': True, 'WD': 0.0001}, 'WORKERS': 4} => loading model from models/pytorch/pose_mpii/pose_resnet_50_256x256.pth.tar Traceback (most recent call last): File "pose_estimation/valid.py", line 165, in main() File "pose_estimation/valid.py", line 123, in main model.load_state_dict(torch.load(config.TEST.MODEL_FILE)) File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/serialization.py", line 367, in load return _load(f, map_location, pickle_module) File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/serialization.py", line 538, in _load result = unpickler.load() File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/serialization.py", line 504, in persistent_load data_type(size), location) File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/serialization.py", line 113, in default_restore_location result = fn(storage, location) File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/serialization.py", line 95, in _cuda_deserialize return obj.cuda(device) File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/_utils.py", line 76, in _cuda return newtype(self.size()).copy(self, non_blocking) File "/home1/yuzijian19/voice/lib/python3.6/site-packages/torch/cuda/init.py", line 496, in _lazy_new return super(_CudaBase, cls).new(cls, *args, **kwargs) RuntimeError: CUDA error: out of memory

YuZijian commented 5 years ago

my pytorch version is 1.0 so I miss step2: Disable cudnn for batch_norm. After change pytroch to 0.4.0 and add this step this problem solved, close this issue please.