rizkiarm / LipNet

Keras implementation of 'LipNet: End-to-End Sentence-level Lipreading'
MIT License
635 stars 226 forks source link

Issue with the preprocessing #39

Open DinoMan opened 6 years ago

DinoMan commented 6 years ago

Hi, I attempted to use your code on the grid dataset. In my case, I had processed the frames to be a tighter fit to the face and saw a far worse performance of the method. I traced it back to the way that you cut the mouth regions. It seems to me that you want to pad the mouth by 38% of its length (19% on each side). However, the way you have done this I believe is wrong since you take the x coordinates of the mouth edges and multiply them by 0.81 and 1.19 respectively and then calculate the width in order to normalize. This way is dependent on the location of the mouth. For example if the mouth edges are at x_left = 100 and x_right=200 then you calculate the width with padding to be 200 x 1.19 + 100 x 0.81 = 157. Lets assume that we have the same mouth now at a different position in the image x_left = 1100 and x_right=1200 then we calculate 1200 x 1.19 + 1100 x 0.81 = 537 which is drastically different even though the mouth actually has the same size. What you actually want is to find the width = x_left- x_right and then find 19% of it (i.e. 0.19 x width) and then add and subtract it to the edges respectively.

In your case the mouths are around x = 200 to 400 which leads to taking quite a lot of padding. If you don't want this then you might have to retrain with the new way of cropping the mouth. Also for anyone that has a different cropping of the grid database will not be able to use your code out-of-the-box

rizkiarm commented 6 years ago

Hi, thanks for pointing out the flaw. I think I haven't found any problem with the current implementation because the mouth location only changed slightly between frames (as it is relative to the face, and it is cropped deterministically).

It would be nice if you can make a pull request for this bug ;)