microsoft / voxelpose-pytorch

Official implementation of "VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment"
MIT License
480 stars 90 forks source link

Questions regarding your work #7

Closed hwied closed 3 years ago

hwied commented 3 years ago

Hi guys,

first of all, this is really great work you did there. Thanks a lot! I have a few question about your work. I'm really looking forward to your reply.

All the best!

Synthetic Heatmaps

I'm searching for the explication on how you generated the synthetic heatmaps. Eventhough, the code is written in a good coding style and mostly very understandable, at some points comments would have been very helpful. It would also be a great help if you could go more into detail how and why you generated synthetic heatmaps. Thank you! :)

Discussion of generalization capabilities

In your paper I'm missing a discussion on the following question: In the panoptic dataset all cameras are in equal distance, and eventhough you chose random cameras for training and testing the cameras stay in a similar configuration (distance and direction) to the scene. Would it be possible to perhaps test an network, which has been trained on the panoptic dataset, on the campus data? This would show real generalization capabilites.

Decoupling meta data information

In your code it seems to be not absolutely clear to me, if the meta data, especially the number of persons in the image, is completely decoupled from the forward call of the model. Perhaps it would be good to give a maximum number of persons the network has to check for. Currently it uses - if I got it right - the meta data information. I'd be happy, if you could explain the meta data more in detailed; what it is and what is it used for?

Thanks in advance! :)

CHUNYUWANG commented 3 years ago

Synthetic heatmaps

We used synthetic heatmaps for Shelf and Campus datasets. You can find the details in the corresponding dataset file, for example, https://github.com/microsoft/voxelpose-pytorch/blob/main/lib/dataset/shelf_synthetic.py

Generalization The sampled camera configurations could be different even though the cameras are uniformly distributed in the dome. But it is interesting to see how it performs when camera configurations are completely different.

meta data The meta data will be used for computing losses and saving debug images during training. The number of people in meta data is only used during training. As you suggested, during testing, we have the parameter MAX_PEOPLE_NUM as can be seen in the configuration file. https://github.com/microsoft/voxelpose-pytorch/blob/main/configs/panoptic/resnet50/prn32_cpn48x48x12_960x512_cam5.yaml

hwied commented 3 years ago

@CHUNYUWANG Thanks for the answers!

Did I get it right from the code (train_3d), that for Campus and Shelf you don't train a backbone net, which generates the heatmaps, but instead generate the heatmaps from the annotations. Because in the paper i understood that you generate this synthetic data to train a backbone... For the panoptic dataset you have this pretrained backbone, which generates the heatmaps, but is not trained in your setup. Correct?

Where was the backbone trained and is there an publication, too? If there is no publication, could you please give the circumstances under which it was trained?

Regarding the camera configuration again: The net gets just the image and no information on each camera position, correct? How is the 3d coordinate system fixed in space? If there is now information on the exact camera positions, I'd assume for Campus and Shelf with one camera set the coordinate system "learned", but for different camera sets in Panoptic I'd guess that the coordinate systems origin might jump and be object to noise, isn't it? Or did you make sure that the center of each camera view is the same and took this as origin?

Thanks in advance!