dkobayas-cyber commented 1 year ago

Bug description

When I try to predict 2D keypoints by HRFormer by:


from mmpose.apis import inference_topdown
from mmpose.apis import init_model 

# Build pose estimator 
pose_estimator = init_model(
    pose_config,
    pose_checkpoint,
    device='cuda:0',
    cfg_options=dict(model=dict(test_cfg=dict(output_heatmaps=False)))
)

pose_results = inference_topdown(pose_estimator, img_path, bbox, 'xywh')

I get an error message below:

Expand the full error message:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument running_mean in method wrapper_CUDA__cudnn_batch_norm)


Cell In[2], line 20, in test_single_image(img_path, bbox_path, pose_estimator)
     18 bbox = load_bbox_and_convert_to_xywh(bbox_path, img_path)
     19 # Predict keypoints 
---> 20 pose_results = inference_topdown(pose_estimator, img_path, bbox, 'xywh')
     21 # Save the result with various metadata
     22 data_sample = merge_data_samples(pose_results)

File ~/Pose-Estimation/mmpose/mmpose/apis/inference.py:191, in inference_topdown(model, img, bboxes, bbox_format)
    189     batch = pseudo_collate(data_list)
    190     with torch.no_grad():
--> 191         results = model.test_step(batch)
    192 else:
    193     results = []

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py:145, in BaseModel.test_step(self, data)
    136 """``BaseModel`` implements ``test_step`` the same as ``val_step``.
    137 
    138 Args:
   (...)
    142     list: The predictions of given data.
    143 """
    144 data = self.data_preprocessor(data, False)
--> 145 return self._run_forward(data, mode='predict')

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py:326, in BaseModel._run_forward(self, data, mode)
    316 """Unpacks data for :meth:`forward`
    317 
    318 Args:
   (...)
    323     dict or list: Results of training or testing mode.
    324 """
    325 if isinstance(data, dict):
--> 326     results = self(**data, mode=mode)
    327 elif isinstance(data, (list, tuple)):
    328     results = self(*data, mode=mode)

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Pose-Estimation/mmpose/mmpose/models/pose_estimators/base.py:140, in BasePoseEstimator.forward(self, inputs, data_samples, mode)
    138         for data_sample in data_samples:
    139             data_sample.set_metainfo(self.metainfo)
--> 140     return self.predict(inputs, data_samples)
    141 elif mode == 'tensor':
    142     return self._forward(inputs)

File ~/Pose-Estimation/mmpose/mmpose/models/pose_estimators/topdown.py:103, in TopdownPoseEstimator.predict(self, inputs, data_samples)
     99 assert self.with_head, (
    100     'The model must have head to perform prediction.')
    102 if self.test_cfg.get('flip_test', False):
--> 103     _feats = self.extract_feat(inputs)
    104     _feats_flip = self.extract_feat(inputs.flip(-1))
    105     feats = [_feats, _feats_flip]

File ~/Pose-Estimation/mmpose/mmpose/models/pose_estimators/base.py:186, in BasePoseEstimator.extract_feat(self, inputs)
    176 def extract_feat(self, inputs: Tensor) -> Tuple[Tensor]:
    177     """Extract features.
    178 
    179     Args:
   (...)
    184         resolutions.
    185     """
--> 186     x = self.backbone(inputs)
    187     if self.with_neck:
    188         x = self.neck(x)

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Pose-Estimation/mmpose/mmpose/models/backbones/hrnet.py:583, in HRNet.forward(self, x)
    581     else:
    582         x_list.append(x)
--> 583 y_list = self.stage2(x_list)
    585 x_list = []
    586 for i in range(self.stage3_cfg['num_branches']):

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/container.py:217, in Sequential.forward(self, input)
    215 def forward(self, input):
    216     for module in self:
--> 217         input = module(input)
    218     return input

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Pose-Estimation/mmpose/mmpose/models/backbones/hrnet.py:200, in HRModule.forward(self, x)
    197     return [self.branches[0](x[0])]
    199 for i in range(self.num_branches):
--> 200     x[i] = self.branches[i](x[i])
    202 x_fuse = []
    203 for i in range(len(self.fuse_layers)):

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/container.py:217, in Sequential.forward(self, input)
    215 def forward(self, input):
    216     for module in self:
--> 217         input = module(input)
    218     return input

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Pose-Estimation/mmpose/mmpose/models/backbones/hrformer.py:387, in HRFormerBlock.forward(self, x)
    385 x = x + self.drop_path(self.attn(self.norm1(x), H, W))
    386 # FFN
--> 387 x = x + self.drop_path(self.ffn(self.norm2(x), H, W))
    388 x = x.permute(0, 2, 1).view(B, C, H, W)
    389 return x

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Pose-Estimation/mmpose/mmpose/models/backbones/hrformer.py:313, in CrossFFN.forward(self, x, H, W)
    311 assert(x.is_cuda)
    312 for layer in self.layers:
--> 313     x = layer(x)
    315 x = nchw_to_nlc(x)
    316 return x

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py:741, in SyncBatchNorm.forward(self, input)
    739 # fallback to framework BN when synchronization is not necessary
    740 if not need_sync:
--> 741     return F.batch_norm(
    742         input,
    743         running_mean,
    744         running_var,
    745         self.weight,
    746         self.bias,
    747         bn_training,
    748         exponential_average_factor,
    749         self.eps,
    750     )
    751 else:
    752     assert bn_training

File /depot/cfrueh/apps/env_mmpose_nospyder/lib/python3.8/site-packages/torch/nn/functional.py:2450, in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
   2447 if training:
   2448     _verify_batch_size(input.size())
-> 2450 return torch.batch_norm(
   2451     input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
   2452 )

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument running_mean in method wrapper_CUDA__cudnn_batch_norm)

Notes

Dataset I used my own custom dataset but it should be fine. It works perfectly when I trained and tested other models (HRNet and ResNet).
Device I trained and tested the model on NVIDIA A10 (CUDA version: 11.7).

Config file I almost directly used configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_hrformer-base_8xb32-210e_coco-256x192.py except that I used my custom dataset.

The configuration file is shown here


# Original file: configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_hrformer-base_8xb32-210e_coco-256x192.py

_base_ = ['../_base_/default_runtime.py']

# runtime
train_cfg = dict(max_epochs=210, val_interval=10)

# optimizer
optim_wrapper = dict(
    optimizer=dict(
        type='AdamW',
        lr=5e-4,
        betas=(0.9, 0.999),
        weight_decay=0.01,
    ),
    paramwise_cfg=dict(
        custom_keys={'relative_position_bias_table': dict(decay_mult=0.)}))

# learning policy
param_scheduler = [
    dict(
        type='LinearLR', begin=0, end=500, start_factor=0.001,
        by_epoch=False),  # warm-up
    dict(
        type='MultiStepLR',
        begin=0,
        end=210,
        milestones=[170, 200],
        gamma=0.1,
        by_epoch=True)
]

# automatically scaling LR based on the actual training batch size
auto_scale_lr = dict(base_batch_size=256)

# hooks
default_hooks = dict(checkpoint=dict(save_best='coco/AP', rule='greater'))

# codec settings
codec = dict(
    type='MSRAHeatmap', input_size=(192, 256), heatmap_size=(48, 64), sigma=2)

# model settings
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
    type='TopdownPoseEstimator',
    data_preprocessor=dict(
        type='PoseDataPreprocessor',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=True),
    backbone=dict(
        type='HRFormer',
        in_channels=3,
        norm_cfg=norm_cfg,
        extra=dict(
            drop_path_rate=0.2,
            with_rpe=True,
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(2, ),
                num_channels=(64, ),
                num_heads=[2],
                mlp_ratios=[4]),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='HRFORMERBLOCK',
                num_blocks=(2, 2),
                num_channels=(78, 156),
                num_heads=[2, 4],
                mlp_ratios=[4, 4],
                window_sizes=[7, 7]),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='HRFORMERBLOCK',
                num_blocks=(2, 2, 2),
                num_channels=(78, 156, 312),
                num_heads=[2, 4, 8],
                mlp_ratios=[4, 4, 4],
                window_sizes=[7, 7, 7]),
            stage4=dict(
                num_modules=2,
                num_branches=4,
                block='HRFORMERBLOCK',
                num_blocks=(2, 2, 2, 2),
                num_channels=(78, 156, 312, 624),
                num_heads=[2, 4, 8, 16],
                mlp_ratios=[4, 4, 4, 4],
                window_sizes=[7, 7, 7, 7])),
        init_cfg=dict(
            type='Pretrained',
            checkpoint='https://download.openmmlab.com/mmpose/'
            'pretrain_models/hrformer_base-32815020_20220226.pth'),
    ),
    head=dict(
        type='HeatmapHead',
        in_channels=78,
        out_channels=38,
        deconv_out_channels=None,
        loss=dict(type='KeypointMSELoss', use_target_weight=True),
        decoder=codec),
    test_cfg=dict(
        flip_test=True,
        flip_mode='heatmap',
        shift_heatmap=True,
    ))

# base dataset settings
dataset_type = 'TESS5KDataset'
data_mode = 'topdown'
data_root = 'data/tess5k/'

# pipelines
train_pipeline = [
    dict(type='LoadImage'),
    dict(type='GetBBoxCenterScale'),
    dict(type='RandomFlip', direction='horizontal'),
    dict(type='RandomHalfBody'),
    dict(type='RandomBBoxTransform'),
    dict(type='TopdownAffine', input_size=codec['input_size']),
    dict(type='GenerateTarget', encoder=codec),
    dict(type='PackPoseInputs')
]

val_pipeline = [
    dict(type='LoadImage'),
    dict(type='GetBBoxCenterScale'),
    dict(type='TopdownAffine', input_size=codec['input_size']),
    dict(type='PackPoseInputs')
]

# data loaders
train_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        ann_file='annotations/train.json',
        data_prefix=dict(img='data/'),
        pipeline=train_pipeline,
    ))
val_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        ann_file='annotations/val.json',
        data_prefix=dict(img='data/'),
        test_mode=True,
        pipeline=val_pipeline,
    ))
test_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        ann_file='annotations/test.json',
        data_prefix=dict(img='data/'),
        test_mode=True,
        pipeline=val_pipeline,
    ))

# evaluators
val_evaluator = dict(
    type='CocoMetric',
    ann_file=data_root + 'annotations/val.json')
test_evaluator = dict(
    type='CocoMetric',
    ann_file=data_root + 'annotations/test.json')

# fp16 settings
fp16 = dict(loss_scale='dynamic')

dkobayas-cyber commented 1 year ago

Some updates on this issue.

I got the same error message when testing the HRFormer on the COCO dataset using the default configuration file, configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_hrformer-base_8xb32-210e_coco-256x192.py.

CLI command I used for testing


python demo/image_demo.py tests/data/coco/000000000785.jpg configs/body_2d_keypoint/topdown_heatmap/coco/hrformer_example.py work_dirs/hrformer_example/best_coco_AP_epoch_1.pth --out-file work_dirs/hrformer_example/vis_results.png

dkobayas-cyber commented 1 year ago

The problem has been resolved. But still not sure which part of the model has a problem...

I found that running_mean and running_var tensors in torch.nn.functional.batch_norm are sometimes on CPU. After moving them to GPU inside of batch_norm function, the problem has been resolved. The related post on Pytorch forum is here.

But I'm not sure why these tensors are on CPU for some layers. Do you have any idea why? If I can figure out more fundamental reasons, that would be better.

MotiBaadror commented 1 year ago

@dkobayas-cyber I run with same error in mmtrack and it was image that was on the cpu.
in the mmpose/api/inference.py try to get the line 185

` if data_list:

collate data list into a batch, which is a dict with following keys:

    # batch['inputs']: a list of input images
    # batch['data_samples']: a list of :obj:`PoseDataSample`
    batch = pseudo_collate(data_list)`

here if you check the all the keys of this batch then it's likely that the input images are on the cpu.

open-mmlab / mmpose

2D keypoint prediction by HRFormer gives RuntimeError #2236

Bug description

Notes

The problem has been resolved. But still not sure which part of the model has a problem...

collate data list into a batch, which is a dict with following keys: