microsoft / multiview-human-pose-estimation-pytorch

This is an official Pytorch implementation of "Cross View Fusion for 3D Human Pose Estimation, ICCV 2019".
MIT License
544 stars 89 forks source link

Process killed during 2D validation #31

Closed rhljajodia closed 4 years ago

rhljajodia commented 4 years ago

Hello,

I am using the pretrained 320_fused model from #14 to validate S9 and S11 data, generated using the H36M-Toolbox.

During validation, the program runs well, untit the last iteration where it gets abruptly killed. Here is the output:

python run/pose2d/valid.py --cfg experiments-local/mixed/resnet50/320_fusion.yaml /home/rjajodia/crossview/run/pose2d/../../lib/core/config.py:197: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. exp_config = edict(yaml.load(f)) => creating output/mixed/multiview_pose_resnet_50/320_fusion => creating log/mixed/multiview_pose_resnet_50/320_fusion2020-10-04-22-57 Namespace(cfg='experiments-local/mixed/resnet50/320_fusion.yaml', dataDir='', data_format='', flip_test=False, frequent=100, gpus=None, logDir='', modelDir='', model_file=None, post_process=False, shift_heatmap=False, state='best', workers=None) {'BACKBONE_MODEL': 'pose_resnet', 'CUDNN': {'BENCHMARK': True, 'DETERMINISTIC': False, 'ENABLED': True}, 'DATASET': {'BBOX': 2000, 'CROP': True, 'DATA_FORMAT': 'zip', 'ROOT': 'data/', 'ROOTIDX': 0, 'ROT_FACTOR': 0, 'SCALE_FACTOR': 0, 'TEST_DATASET': 'multiview_h36m', 'TEST_SUBSET': 'validation', 'TRAIN_DATASET': 'mixed', 'TRAIN_SUBSET': 'train'}, 'DATA_DIR': '', 'DEBUG': {'DEBUG': True, 'SAVE_BATCH_IMAGES_GT': True, 'SAVE_BATCH_IMAGES_PRED': True, 'SAVE_HEATMAPS_GT': True, 'SAVE_HEATMAPS_PRED': True}, 'GPUS': '0', 'LOG_DIR': 'log', 'LOSS': {'USE_TARGET_WEIGHT': True}, 'MODEL': 'multiview_pose_resnet', 'MODEL_EXTRA': {'FINAL_CONV_KERNEL': 1, 'PRETRAINED_LAYERS': ['conv1', 'bn1', 'conv2', 'bn2', 'layer1', 'transition1', 'stage2', 'transition2', 'stage3', 'transition3', 'stage4'], 'STAGE2': {'BLOCK': 'BASIC', 'FUSE_METHOD': 'SUM', 'NUM_BLOCKS': [4, 4], 'NUM_BRANCHES': 2, 'NUM_CHANNELS': [48, 96], 'NUM_MODULES': 1}, 'STAGE3': {'BLOCK': 'BASIC', 'FUSE_METHOD': 'SUM', 'NUM_BLOCKS': [4, 4, 4], 'NUM_BRANCHES': 3, 'NUM_CHANNELS': [48, 96, 192], 'NUM_MODULES': 4}, 'STAGE4': {'BLOCK': 'BASIC', 'FUSE_METHOD': 'SUM', 'NUM_BLOCKS': [4, 4, 4, 4], 'NUM_BRANCHES': 4, 'NUM_CHANNELS': [48, 96, 192, 384], 'NUM_MODULES': 3}}, 'NETWORK': {'AGGRE': True, 'HEATMAP_SIZE': array([80, 80]), 'IMAGE_SIZE': array([320, 320]), 'NUM_JOINTS': 20, 'PRETRAINED': 'models/pytorch/imagenet/resnet50-19c8e357.pth', 'SIGMA': 3, 'TARGET_TYPE': 'gaussian'}, 'OUTPUT_DIR': 'output', 'PICT_STRUCT': {'DEBUG': False, 'FIRST_NBINS': 16, 'GRID_SIZE': 2000, 'LIMB_LENGTH_TOLERANCE': 150, 'PAIRWISE_FILE': 'data/pict/pairwise.pkl', 'RECUR_DEPTH': 10, 'RECUR_NBINS': 2, 'SHOW_CROPIMG': False, 'SHOW_HEATIMG': False, 'SHOW_ORIIMG': False, 'TEST_PAIRWISE': False}, 'POSE_RESNET': {'DECONV_WITH_BIAS': False, 'FINAL_CONV_KERNEL': 1, 'NUM_DECONV_FILTERS': [256, 256, 256], 'NUM_DECONV_KERNELS': [4, 4, 4], 'NUM_DECONV_LAYERS': 3, 'NUM_LAYERS': 50}, 'PRINT_FREQ': 100, 'TEST': {'BATCH_SIZE': 2, 'BBOX_FILE': '', 'BBOX_THRE': 1.0, 'DETECTOR': 'fpn_dcn', 'DETECTOR_DIR': '', 'HEATMAP_LOCATION_FILE': 'predicted_heatmaps.h5', 'IMAGE_THRE': 0.1, 'IN_VIS_THRE': 0.0, 'MATCH_IOU_THRE': 0.3, 'MODEL_FILE': '', 'NMS_THRE': 0.6, 'OKS_THRE': 0.5, 'POST_PROCESS': False, 'SHIFT_HEATMAP': False, 'STATE': 'best', 'USE_GT_BBOX': True}, 'TRAIN': {'BATCH_SIZE': 2, 'BEGIN_EPOCH': 0, 'END_EPOCH': 30, 'GAMMA1': 0.99, 'GAMMA2': 0.0, 'LR': 0.001, 'LR_FACTOR': 0.1, 'LR_STEP': [20, 25], 'MOMENTUM': 0.9, 'NESTEROV': False, 'OPTIMIZER': 'adam', 'RESUME': True, 'SHUFFLE': True, 'WD': 0.0001}, 'WORKERS': 4} => loading model from output/mixed/multiview_pose_resnet_50/320_fusion/model_best.pth.tar /home/rjajodia/.conda/envs/crossview/lib/python3.8/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead. warnings.warn(warning.format(ret)) Test: [0/1076] Time 7.371 (7.371) Loss 0.0431 (0.0431) Accuracy 1.000 (1.000) Test: [100/1076] Time 0.090 (0.164) Loss 0.0753 (0.0546) Accuracy 1.000 (0.994) Test: [200/1076] Time 0.091 (0.128) Loss 0.7771 (0.0897) Accuracy 0.007 (0.941) Test: [300/1076] Time 0.092 (0.117) Loss 0.0597 (0.0773) Accuracy 0.993 (0.958) Test: [400/1076] Time 0.093 (0.111) Loss 0.0414 (0.0977) Accuracy 1.000 (0.935) Test: [500/1076] Time 0.092 (0.108) Loss 0.7098 (0.1056) Accuracy 0.000 (0.924) Test: [600/1076] Time 0.092 (0.105) Loss 0.0442 (0.0975) Accuracy 1.000 (0.934) Test: [700/1076] Time 0.094 (0.104) Loss 0.0347 (0.0883) Accuracy 1.000 (0.943) Test: [800/1076] Time 0.093 (0.103) Loss 0.0726 (0.0822) Accuracy 0.993 (0.950) Test: [900/1076] Time 0.093 (0.102) Loss 0.0845 (0.0785) Accuracy 0.978 (0.955) Test: [1000/1076] Time 0.094 (0.102) Loss 0.0362 (0.0752) Accuracy 1.000 (0.959) Killed

I tried to look at the logs but because the process gets killed no logs are written. Any ideas?

Thank you.

rhljajodia commented 4 years ago

Hello. This issue is fixed. The problem was that all_heatmaps variable in lib/core/function.py was getting too large for my RAM (16GB) so the h5py write operation (line 228) was failing. I changed it to write in chunks instead of at once in the end. This allowed me to not have to use the all_heatmaps variable. Note that the all_preds variable is used by the logger, which is why I did not remove it. However, this can also be deleted to save more RAM space if needed. Post-fix, uses about 14GB ram for full validation execution.

poincarelee commented 1 year ago

hi, I met the same error. can you tell me how to change to write in chunks?