Investigate DensePose as alternative to OpenPose

vvasco commented 6 years ago

DensePose maps human pixels of 2D RGB images to a 3D surface-based model of the body. Resource: https://research.fb.com/facebook-open-sources-densepose/

vvasco commented 5 years ago

Installation

DensePose relies on caffe2 and detectron framework. I installed the last version of DensePose on my machine:

System: Ubuntu 18.04 Graphics card: NVIDIA GeForce GTX 1050 Ti Graphics card memory: 4096 MB CUDA: 10.0 cuDNN: 7.3.1 Caffe2: Built from source (this version)

Following the instructions, I couldn't successfully install DensePose, but in this very useful blog I found solutions to all the problems I encountered (specifically 2.1, 2.2, 2.7, 2.9). After solving them, I managed to install DensePose successfully.

Usage

DensePose maps all pixels of an RGB image belonging to humans to the 3D surface of the human body. It relies on DensePose-RCNN to obtain dense part indexes and coordinates within each of the selected parts (IUV representation).

Note: At the current stage, all the provided tools are in python.

Using the IUV representation

I followed this tutorial to run inference on a dataset acquired using the RealSense. In the example, they use a ResNet-101-FPN, which I couldn't run on my machine due to memory problems (apparently 4 GB is not enough for this network). However I could use the ResNet-50-FPN, using the infer_simple.py:

python2 tools/infer_simple.py \
    --cfg configs/DensePose_ResNet50_FPN_s1x-e2e.yaml \
    --output-dir '/home/vvasco/dev/datasets/r1/latency-dataset/densepose-infer/' \
    --image-ext ppm \
    --wts https://dl.fbaipublicfiles.com/densepose/DensePose_ResNet50_FPN_s1x-e2e.pkl \
    '/home/vvasco/dev/datasets/r1/latency-dataset/img/'

This tool outputs for each image of the input dataset:

a *_.pdf: image containing bounding boxes around people;
a *_IUV.png: image containing part indexes I (24 surface patches) and their U and V coordinates;
a *_INSD.png: image containing the segmented parts;

This is a comparison between yarpOpenPose and DensePose:

`yarpOpenPose`	`DensePose`

Qualitatively, DensePose seems to work well in terms of separating the human parts correctly, also when moving. However, parts occluded completely disappear (for example the arm and the hand when behind the chair).

Note: I had to change detectron/utils/vis.py according to this in order to have unempty INDS images.

Mapping IUV values to 3D SMPL models

In addition to the IUV representation, it's also possible to map the predicted points onto a 3D human model SMPL. This notebook shows how to map IUV values to 3D SMPL model, but relies on a file demo_dp_single_ann.pkl which doesn't have references on how to construct it. There is an open PR which does not rely on any file and also speeds up the conversion from IUV to XYZ point on the SMPL. I'm not sure why it is not merged though. I used this fork for mapping, and this is the result on a single image, with 3D points in red and the model in black:

Segmented image	3D points mapped on SMPL

The face is not fully mapped as it is not fully visible on the image, but the points look correctly mapped onto the different patches of the model. We can also distinguish the frontal part of a person from the posterior part (with yarpOpenPose this is not possible, unless we associate a face to the skeleton). It looks promising! However, there are several points might be critical:

the notebook seems to work when there is a single person: there is a pick index, but it's unclear to me how to select it;
the time it takes for mapping a single person with the fast implementation is ~8 s;
the template model is in a fixed position (static).

pattacini commented 5 years ago

Awesome analysis @vvasco 🥇

Here below I've tried to summarize the essential traits you identified; please, correct me if I'm wrong:

DensePose is in Python only, as of now.
In real-time contexts, it provides us with a richer set of body features in 2D compared with OpenPose.
If we want to extract 3D info instead, DensePose seems to be requiring 8 [s] per image.

pattacini commented 5 years ago

However, parts occluded completely disappear (for example the arm and the hand when behind the chair).

I think I didn't get this point. From the snapshots, It looks like this is holding for both OpenPose and DensePose.

vvasco commented 5 years ago

I think I didn't get this point. From the snapshots, It looks like this is holding for both OpenPose and DensePose.

Let me expand a bit this point. What I mean is that if there is an occlusion in DensePose, whole body parts get lost (even if they are not entirely occluded). In yarpOpenPose when key points are missed due to occlusions in 2D, we can still reconstruct them in 3D by applying limb optimization. This might be more difficult when dealing with body parts and would require further investigation.

Here below I've tried to summarize the essential traits you identified; please, correct me if I'm wrong: DensePose is in Python only, as of now. In real-time contexts, it provides us with a richer set of body features in 2D compared with OpenPose. If we want to extract 3D info instead, DensePose seems to be requiring 8 [s] per image.

Exactly! Let me also stress that the examples that I found only deal with a single person and it might not be straightforward to deal with multiple people.

pattacini commented 5 years ago

That's great! Thanks @vvasco for this very exhaustive report. I think we have now a very clear picture of how DensePose lies with respect to our methodology.

wine3603 commented 5 years ago

Hey, I followed your steps but in my case the final points on the smpl is always 0, (picked_person = INDS==1), the output: 'num pts on picked person: (0,) (0, 3)' The former visualizations of IUV are all good, I don't know where is the problem, any helps will be appreciated!

wine3603 commented 5 years ago

I found the problem is that this note only works for a single person image. If I input an image with more than 1 people, the num pts on the picked person would be 0.

vvasco commented 5 years ago

Hi @wine3603, thanks for the interest in this issue! The problem is exactly the one you spotted: when you have an image with multiple people, while you can create the IUV representation, you cannot map the points on the 3D SMPL model.

The notebook actually includes a pick index that intuitively, according to me, should be used to select a person from an image with multiple people. Instead, in this case what happens is that no points are found, whatever index you select. It only works with images with one person.

wine3603 commented 5 years ago

Hey @vvasco , Ｔｈａｎｋｓ for your reply, I am trying to find out how was the INDS.png generated. I am wondering if INDS==0 means the background masks, then does INDS==1 indicate all the human masks or the first human?

vvasco commented 5 years ago

Hi @wine3603, I don't think there is a specific order in the INDS values. Therefore, INDS=0 on the background, then INDS can have different non-zero values (not necessarily 1) according to the number of people.

For example, if you open this INDS image (if you have Matlab, you can use imread), you will see that INDS has several values and different than 1.

wine3603 commented 5 years ago

Hi @vvasco , Thanks a lot, now I know where was my misunderstanding. I followed this notebook, and I found that in the In[4],

pick_idx = 1 # PICK PERSON INDEX! C = np.where(INDS == pick_idx)

I thought the people masks are labeled as different ID numbers so that this png is named as "index". Now I understand it was not int, this “1” and the backgrounds “0"　ａｒｅ boolean indexing....... In the case of multi-human images, we have to find a way to generate INDS.png with human IDs, do you have any ideas?

vvasco commented 5 years ago

The IUV representation provides the part indexes detected and their pixel coordinates. So you might first transform this into a temporary representation where you associate all 1s to the detected parts. You should extract the positions of the bounding boxes from the pdf image and you might use these to cluster the detected humans: all 1s belonging to the same bounding box form a cluster. The cluster ID would finally be the human ID.

wine3603 commented 4 years ago

Hi, @vvasco , I am trying to map multi-view images for one person to the smpl model. may I ask what is the "3D points mapped on SMPL" visualizer are you using? I want to try mapping 4 images from 4 corner cameras.

vvasco commented 4 years ago

Hi @wine3603, I used this notebook for generating the image. I added this section to the notebook, to make the plot interactive, using plotly library :

import plotly.graph_objs as go
trace1 = go.Scatter3d(
    x=Z, y=X, z=Y,
    mode = 'markers',
    marker=dict(
        color='black',
        size=0.5
    ),
)
trace2 = go.Scatter3d(
    x=collected_z, y=collected_x, z=collected_y,
    mode = 'markers',
    marker=dict(
        color=np.arange(IUV_pick.shape[0]),
        size=1
    ),
)
data = [trace1,trace2]
layout = go.Layout(
    title='Points on the SMPL model',
    showlegend=False,
    scene = dict(
        xaxis = dict(
            range=[-1.0, 1.0],
            title='z'
        ),
        yaxis = dict(
            range=[-1.0, 1.0],
            title='x'
        ),
        zaxis = dict(
            range=[-1.4, 1.0],
            title='y'
        ),
    )
)

fig = dict(data=data, layout=layout)
iplot(fig)

with X,Y,Z identifying the model and collected_x,collected_y,collected_z being the points picked on the person.

frankkim1108 commented 4 years ago

@wine3603 IUV is INDEX, U coordinates, V coordinates. INDS has total of 24 values. Each 24 value represent different parts of the body. INDS==1 is all the coordinates for the back. So if your picture is facing forward, there wouldn't be any coordinates that correspond to the back

I checked out all the parts of the body and found out which number represent which.

body(back)
body (front)
hand (right)
hand (left)
foot (left)
foot (right)
thigh (right, back)
thigh (left , back)
thigh (right , front)
thigh (left, front)
calf (right, back)
calf (left , back)
calf (right, front)
calf (left, front)
upper arm (left, front)
upper arm (right ,front)
upper arm (left, back)
upper arm (right, back)
lower arm (left, front)
lower arm (right ,front)
lower arm (left, back)
lower arm (right, back)
head (right)
head (left)

so if you want to get all the coordinates of the full body try INDS >= 1

or if you want to get specific body parts use the numbers that represent each body parts.

wine3603 commented 4 years ago

Thanks for your replies! @vvasco @frankkim1108 I don't mean to add more specific body parts to the UV map, I want to combine 4 uv maps generated from 4 viewpoints around the target body. Is there a good way to fuse the different uv maps to one mode and handle the overlapped parts?

robotology / assistive-rehab