google / mannequinchallenge

Inference code and trained models for "Learning the Depths of Moving People by Watching Frozen People."
https://google.github.io/mannequinchallenge
Apache License 2.0
490 stars 104 forks source link

Inference of the multi-view model - guidance required #9

Open Tetsujinfr opened 4 years ago

Tetsujinfr commented 4 years ago

Ok so thanks for all etc but if I want to infer your advanced model on something different than the TUM preprocessed dataset you provide, using the 2 views and masks and optic flaw and key points, does it mean I have to generate all those inputs at the appropriate formats, cast it properly into the expexted HDF5 file structure, all that without any specifications? For the RGB images, I can manage, but for the flow input what is the expected input format? For the optional keypoints, what is the expected data format? Can you model support more/less key points than the ones you have in your dictionnaries lists currently? Back-engeneering your code is not fun. In the TUM HDF5 files there is more than 5 data subset (I did expect 2 rgb images, one binary mask image, one rgb flow image, one vector of key points pairs), but there are some strange low resolution matrices in the file as well) I do not know if those are only necessary for training of for inference as well. Any chance you can give some further guidance on the required inputs and required formats? Thanks

fcole commented 4 years ago

Hi, yes, the code could be more easy to understand, sorry about that. In order to run the full model, you basically need to fill in the dictionary of values specified here:

https://github.com/google/mannequinchallenge/blob/3448d9d49dc130db7ed18053b70f66bc157d238f/loaders/image_folder.py#L180

these correspond to the various buffers mentioned in loss definitions in the paper. You don't need to create your own HDF5s etc. as long as you can create a dictionary including those buffers. Hope that helps.

rmbashirov commented 4 years ago

How to I get gt_depth, human_mask, flow, keypoints_img in your format on my own data so that I can inference your full model?

rmbashirov commented 4 years ago

Ok, I realised that providing full pipeline for inferring full model on any data is almost impossible for you.

Can you provide inference result of your full model for MC dataset?

fcole commented 4 years ago

Unfortunately, we don't have permission to share image-like-results (e.g., depth buffers) from the MC dataset. Sorry about that.

For inference, you shouldn't need the gt_depth, and the model with keypoints input performs only marginally better than the model without, so the only things you really need are flow and the human mask.

Tetsujinfr commented 4 years ago

Thanks for your indications above. So I looked at the load_tum_hdf5 function. I have a few questions.

A) the code reads 11 objects:

 - img_1: is it a simple 24bits RGB numpy matrix of the image we are trying to infer?

 - gt_depth: I can ignore it since I just want to infer, but can I just comment this piece of code and all downstream reference to the treatment of this object or do I need to fake a dummy input?

 - lr_error: what is that? Can I ignore the same way as gt_depth? It looks like it is used to comput the confidence map which seems a key input to your model no?

 - human_mask: I assume this is a binary mask of the same size as img_1, right? What is the format expected, 0.0=transparent and 1.0 equals opaque, i.e. the mask shape? (Rgb black/white image I assume?)

 - angle_prior: what is that? Is it the second image? It looks like it is used to comput the confidence map which seems a key input to your model no?

 - pp_depth: what is that? It looks like it is used to comput the confidence map which seems a key input to your model no?

 - flow: the output of FlowNet2 I assume but is it a 24bits rgb image or is it the raw flow data structure of the .flo object of FlowNet? Does it need to have the exact same height × width size as img_1 ?

 - T_1_G: what is that? It looks like it is used to comput the confidence map which seems a key input to your model no?

 - T_2_G: same as for T_1_G

 -  intrinsic: same as for T_1_G

 - keypoints_img: can I just input a keypoint image from OpenPose for instance? Do the points need to be single pixels? Is there a particular colouring scheme for each point which needs to be followed or I can just pass the colouring of OpenPose?

Thanks a lot for your guidance on this.

zhengqili commented 4 years ago

Hi, I am the first author of this paper.

img_1: should be RGB image between 0 and 1 lr_error: is the left-right consistency error corresponding to C_lr in Eq.5 of supplementary material: http://www.cs.cornell.edu/~zl548/images/mannequin_depth_cvpr2019_supp_doc.pdf human_mask: is the binary mask, where 1 indicates human, 0 indicates background. angle_prior: is C_pa in Eq.5 of supplementary material. pp_depth: depth from motion parallax using P+P representation in Eq. 4 of supplementary material. T_1_G: this is the 4X4 homogenous transformation matrix from global to reference image described in the paper. T_2_G: this is the 4X4 homogenous transformation matrix from global to source image in the paper. intrinsic: 3X3 intrinsic matrix keypoints_img: You can use any keypoint detection algorithm you want but you have to normalize their index based on MaskRCNN. In particular, what you need to do is from https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/vis.py, you can see line 198-199: i1 = kp_lines[l][0] i2 = kp_lines[l][1],

you need to normalize it by using following code: final_i1_value = (i1 + 1.0)/18.0 final_i2_value = (i2 + 1.0)/18.0

Please send me email (zl548@cornell.edu) for more questions since I seldom reply in Github.