rizkiarm / LipNet

Keras implementation of 'LipNet: End-to-End Sentence-level Lipreading'
MIT License
635 stars 226 forks source link

ValueError: Dimension 0 in both shapes must be equal occurs when using predict method on various images #17

Open AshwinPathi opened 7 years ago

AshwinPathi commented 7 years ago

For some images in the GRID dataset, when using the ./predict method and any weight file, the code gives me an error. One such file that does this is "lrarzn.mpg", which is in in the s1 directory of the GRID dataset, however, there are many more files that trigger this error.

ValueError: Dimension 0 in both shapes must be equal, but are 38016 and 1728 for 'Assign_18' (op: 'Assign') with input shapes: [38016,768], [1728,768].

The "Weights" folder is a folder i created in the LipNet root directory for the sake of convenience.

screenshot from 2017-08-23 19-29-25

Also encountered this error when processing files. The "custom_evaluation" method is a way for me to evaluate pictures in bulk using the method in predict.py. This is for making evaluation easier. It should not effect the actual mechanisms of the code in any way. screenshot from 2017-08-23 20-19-42

michiyosony commented 7 years ago

If you watch the video lrarzn.mpg, it starts with a grey screen. In the LipNet paper, they write "The videos for speaker 21 are missing, and a few others are empty or corrupt, leaving 32746 usable videos." I would guess that you found one of the corrupt videos and that this implementation of predict.py doesn't handle them gracefully.

(When I was running extract_mouth_batch.py, the script was unable to process videos with these grey screens as well as videos where the speaker covered their mouth. I wound up removing those videos from the data set.)

AshwinPathi commented 6 years ago

I am also planning to retrain the model on a different dataset (unseen split) to test some new parameters. Does the training run into this issue as well? Or do I have to manually eliminate those images? Furthermore, could you reiterate the process for training for the unseen split? The one on the readme does not make that much sense to me... For instance, what format should the "align" folder be in? Just every single align in a single folder? Or should it be labeled s1, s2, s3, etc...

rizkiarm commented 6 years ago

Some of the videos in GRID is corrupted. Feeding one of those corrupt video would fail the ./predict as it doesn't have any video integrity and compatibility check.

Training is not affected by this fact because the generator enumerates and check the video beforehand.

As for align, just drop every align file in that folder without any labeled speaker folder.

bengfarrell commented 6 years ago

I was hoping to add on to this issue with a general question about this project. I'm new to ML, but had an idea for a project and lip reading would be awesome for it. I struggled for a bit to get this installed and finally did. The sample GRID videos work great. The problem comes in when I try to use my own videos. Predict fails with the error listed in this thread. The extraction methods extract the entire frame (it DOES work for the GRID videos to isolate the mouth). To take the extraction out of the equation, I used this https://github.com/astorfi/lip-reading-deeplearning to do the mouth extraction - and then used the output folders (series of PNGs) with LipNet. Like before, if I did this process with a GRID video, it works great....my videos....I get the above error. It's leading me to believe that while this implementation is a pretty awesome demo, it's not just going to work out of the box for any video. Is that a fair statement, or am I missing something? I haven't touched the training piece of this project, but given that I'm a newcomer, I'm doubting that I can train a model better than you or even what that means if I need to pre-train against a video, series of videos, etc. Again, if I'm missing something, please do let me know - otherwise, great work despite not quite being of general use to folks like myself!

chahatagarwal commented 4 years ago

I was hoping to add on to this issue with a general question about this project. I'm new to ML, but had an idea for a project and lip reading would be awesome for it. I struggled for a bit to get this installed and finally did. The sample GRID videos work great. The problem comes in when I try to use my own videos. Predict fails with the error listed in this thread. The extraction methods extract the entire frame (it DOES work for the GRID videos to isolate the mouth). To take the extraction out of the equation, I used this https://github.com/astorfi/lip-reading-deeplearning to do the mouth extraction - and then used the output folders (series of PNGs) with LipNet. Like before, if I did this process with a GRID video, it works great....my videos....I get the above error. It's leading me to believe that while this implementation is a pretty awesome demo, it's not just going to work out of the box for any video. Is that a fair statement, or am I missing something? I haven't touched the training piece of this project, but given that I'm a newcomer, I'm doubting that I can train a model better than you or even what that means if I need to pre-train against a video, series of videos, etc. Again, if I'm missing something, please do let me know - otherwise, great work despite not quite being of general use to folks like myself!

Did you achieve the outcome for a generic video ?