hongsukchoi / 3DCrowdNet_RELEASE

Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022
MIT License
156 stars 14 forks source link

some question about process 2d datasets #12

Open yuchen-ji opened 2 years ago

yuchen-ji commented 2 years ago

i found that you crop the closest person for 3d datasets if have multiple person in one image, but for 2d dataset you may crop all persons in one image. why you process these datasets differently? in addition, the cropped image may contain another person, won't this process bring ambiguity to the network? thank you very much!

hongsukchoi commented 2 years ago

i found that you crop the closest person for 3d datasets if have multiple person in one image, but for 2d dataset you may crop all persons in one image. why you process these datasets differently?

The cropping process is the same regardless of datasets. Even if you crop the closest person for 3d datasets, there could be other people in the cropped image. And actually MuCo, which you are referring to, does not have multiple person in one image originally. It synthesizes multiple real person images to one image using depths.

in addition, the cropped image may contain another person, won't this process bring ambiguity to the network?

That is the challenge of crowded scenes, which 3DCrowdNet resolves. Please see the paper.

yuchen-ji commented 2 years ago

i found that you crop the closest person for 3d datasets if have multiple person in one image, but for 2d dataset you may crop all persons in one image. why you process these datasets differently?

The cropping process is the same regardless of datasets. Even if you crop the closest person for 3d datasets, there could be other people in the cropped image. And actually MuCo, which you are referring to, does not have multiple person in one image originally. It synthesizes multiple real person images to one image using depths.

in addition, the cropped image may contain another person, won't this process bring ambiguity to the network?

That is the challenge of crowded scenes, which 3DCrowdNet resolves. Please see the paper.

Thanks for your reply! I found many up-to-down methods using single person datasets for training, these can prevent ambiguity during training, which the cropped image does not contain another person. while for inference, the cropped image often contain other persons, but it often regress the right person's smpl parameters. Does this mean that if other people are included in the cropped image during training, it will bring ambiguity to the network and make it difficult for training. In 3DCrowdNet, the cropped image contain other persons even for training. but add 2d robust pose heatmap to resolve the ambiguity. Is my understanding correct?

hongsukchoi commented 2 years ago

Does this mean that if other people are included in the cropped image during training, it will bring ambiguity to the network and make it difficult for training.

No. I think you are confused with how deep learning works. Given accurate ground truth, a neural network becomes robust to the ambiguity during training. Then, the neural network performs better on those ambiguous input in test time.