patrikhuber / superviseddescent

C++11 implementation of the supervised descent optimisation method
http://patrikhuber.github.io/superviseddescent/
Apache License 2.0
402 stars 188 forks source link

Implementation limitation #20

Open genawass opened 8 years ago

genawass commented 8 years ago

The original SDM demo was released in Intraface project (Matlab and C++ versions) Comparing to Intraface, your implementation is not as robust and stable. Of course the reasons may be (1) descriptor type (Intraface use modified SIFT), (2) Training database (Intraface use movies for tracking) , (3) Implementation details. Have you compared your SDM implementation with Intraface?

patrikhuber commented 8 years ago

Hi!

Yes, we did compare, our supervised descent model on COFW achieves a state-of-the-art average error of 0.071, which is even a bit better than reported in our paper (http://www.patrikhuber.ch/files/RCRC_SPL_2015.pdf).

You're right however in that the pre-trained LFPW model is not as good. It's however not really possible to compare on the "original LFPW", since there's no standard set of images (everybody uses a different subset, since the images have to be downloaded from the web). In fact we didn't use around 200 or so images for the training, which they used.

Since Intraface have never released their code (they say in their paper they released code, but they only ever released a library in binary form) and they also will not release it, it's not possible to compare implementations. Our benchmarks show it's working fine though and it's just a matter of choosing a perturbation/initialisation strategy, a large enough training database and tuning the feature extraction windows a bit. I hope I'll have some time soon-ish to replace the model with one trained on a larger database, but you can actually also do that quite easily with rcr-train.

Regarding their SIFT, we tested the HOG descriptor quite extensively and vlhog is really good, I don't think Xiong's xxsift is better, it's just a bit faster I think.

genawass commented 8 years ago

Hi, Patrik Thank you for your reply. I was referring to real time tracking, and as I noticed supervisedescend is restarting much more frequently comparing to Intraface. Actually, the landmarks alignment from detection usually works fine the problem is with tracking between frames, since there are much more options for model to study the deltas and features. I think that in order to solve it you need to train it with movies as was mentioned in the article. There is an online movies database http://ibug.doc.ic.ac.uk/resources/300-VW/

patrikhuber commented 8 years ago

Hi, Ah, I see. Yep, I haven't spent much time with tracking (yet) actually. Our tracking is quite stupid currently, particularly the reinitialisation. I don't think we necessarily need to train on a movie database (though it would surely help too), but just train on a bit more images and with a different training strategy. There's actually a great discussion in #19 with a lot of information!

I was actually working on 300-VW a few weeks ago and hope to be able to continue it at some point in the next months. In the meantime, I'm happy to help in any way I can.

genawass commented 8 years ago

Yeah, I've seen the discussion #19 and I have to disagree. Actually I think that I've reverse engineered closely the Intraface implementation. First of all, at initialization from previous frame they resize and rotate the image in order to stabilize image conditions as was used in training. As you know SIFT is rotation invariant, however 8 bin SIFT means 180/8 deg per bin and you need to rotate to normalize the conditions if face is too rotated. Second, you have to initialize the shape with mean shape (aligned on the image based on previous frame shape) and not with previous shape, since the regressor won't be able to learn all the statistics of different initializations and the shape will diverge very fast due to errors accumulation (in couple of frames). I have an implementation of SDM and it performs very similarly to Intraface, with the only difference in descriptor. I'm using vl_sift and vl_hog and they are too slow comparing to xx_sift. It would be nice to reveal the code of xx_sift

patrikhuber commented 8 years ago

The rotation is actually something I've seen from Tim Cootes in a presentation yesterday, I definitely want to try that. Your insight regarding initialisation in the tracking is very interesting. It's certainly a likely possibility that seems to be backed up by your experiments. Thank you for sharing that! Isn't xxsift just a kind of PCA-SIFT, that is, learning a PCA on the descriptors offline and then using it to reduce the dimensionality of the descriptor?

Thanh-Binh commented 8 years ago

Could you please explain me how to calculate "the shape with mean shape (aligned on the image based on previous frame shape)"?

genawass commented 8 years ago

You can use Procrustes analysis in order to find rotation, translation and scaling that aligns mean shape with previous frame shape

Thanh-Binh commented 8 years ago

Thanks. I think it is the way, Patrik uses for his tracking? to Patrik: what do you think?

to you + Patrik: Could you please explain me how to create a tracking model that learns the tracking using video?

patrikhuber commented 8 years ago

@Thanh-Binh: Yea more or less - I didn't correct the rotation, as mentioned before. I translate and scale the mean landmarks to be best aligned with the landmarks from the previous frame.

Regarding the training: Well, after what @genawass posted, I'm actually not sure anymore that training using a different initialisation/perturbation strategy for tracking (i.e. one that perturbs around the ground-truth) would be successful. Maybe initialising the current frame from the previous one using the mean landmarks is in fact the way to go. I think only a quantitative experiment, for example on 300-VW, would show which one works best!

What I mean by different initialisation/perturbation strategy for tracking is the following: In L420 the x_0 would be the ground-truth, and in L429-L432 we would create the x_0 by perturbing with a normal distribution around the ground-truth location, setting the sigma to a value of the movement we expect from frame to frame.

See also #19.

Thanh-Binh commented 8 years ago

To Genawass: I show Procustes analysis, and try to understand how to calculate the initial shape for the next frame: As input, you have the mean shape and the shape of the previous frame. Here, you can calculate the translation, uniform scaling of mean shape, and of the shape of the previous frame. You can calculate the rotation between them. What do you do further? Use rotation to calculate the initial shape from mean shape? Could you please explain me details? Thanks.

genawass commented 8 years ago

Similarly, as using still images, only you need to align the mean shape in the initial step using different strategy (based on the shape from the previous frame). The learning phase remains the same

genawass commented 8 years ago

You need to use transformation that you've found by Procrustes to transform the mean shape as close as possible to the ground_truth/previous_frame shape. Then you need to crop, scale and rotate the image (and the transformed mean shape accordingly) in order to normalize as possible the input data, since the descriptor is not invariant to high perturbations in scale/rotation

genawass commented 8 years ago

Patrick, it is not a PCA SIFT, since they use default 128 size descriptor per keypoint (8 bin descriptor = 8x4x4=128). The trick is probably in histogram calculation and trilinear interpolation.

patrikhuber commented 8 years ago

@genawass: Hmm! I thought I read something about it that suggested it was a kind of PCA-SIFT. Probably it was just the own assumption of my colleagues and me then. Thanks for the hint!

genawass commented 8 years ago

Also I do not think that using still images you can get all the extreme poses as in video. Face detector can find the face for frontal head pose with slight head rotations. In order for tracker to deal with extreme rotations you need to train it with extreme poses data.