vchoutas / smplify-x

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image
https://smpl-x.is.tue.mpg.de/
Other
1.69k stars 330 forks source link

estimating camera and shape only once for the whole sequence? #187

Open matejhof opened 1 year ago

matejhof commented 1 year ago

In the way we are running SMPLify, it reestimates shape and camera position with every frame. In our datasets, we know that it was the same person (same shape) and that the camera position was fixed. Could we estimate the shape and the camera from the whole session (e.g., 5-min. video), then fix it and run through the video again and estimate only the poses?

Baalon commented 1 year ago

From what I see, the camera optimisation is here, lines 313 to 368: https://github.com/vchoutas/smplify-x/blob/master/smplifyx/fit_single_frame.py This function is called directly by the main, line 245, in the loop that processes frames individually.

To avoid re-estimation every frame, I guess the easiest way is to run this only on the first frame, return the camera settings, and then use these settings for subsequent loops. There's also some camera reset / initialisation to take care of in a similar way, lines 266 to 303.

As for shape, my guess would be that it is within the functions called between the lines 389 to 439, which optimise the body model.

However, this comes with the drawback that if the first frame's settings aren't that good, it might screw up the rest.

Your suggestion to take estimations from the whole session comes with significant extra processing cost, as well as a decision problem about which settings to use:

lllllialois commented 1 year ago
  • Make a re-projection of the 3D onto the 2D image, and select the settings of the frame that has the highest overlap? This seems to be a metric that is used by a similar tool during the optimisation process (https://github.com/nkolot/SPIN), but it will most likely increase the processing time significantly. And as far as I know, detecting shapes such as the human body is not trivial, and might be prone to errors as well.

It seems that current estimates of the shape of the human body are not particularly accurate, and after my experiments, I found that I could not estimate inputs that were too fat or too thin