After training, audio-to-kp LSTM predicting identical "open-mouth" keypoints

soubhikbarari commented 5 years ago

Hi @karanvivekbhargava -- thanks for a really great implementation of the ObamaNet framework, this has been a real joy to work with. I'm wondering if you or any other others have run into any snafus with training the audio-to-keypoint LSTM you have implemented train.py. After training yours for about 50 epochs, the LSTM is only predicting the same "open mouth" keypoint vectors for every audio timestep, like so:

screen shot 2019-01-22 at 1 00 08 pm

Some more details if they're useful:

I've dumped the images using a frame rate of 20, as opposed to 25 or 30.
I'm using the default logfbank feature representation of the audio keypoints.
The processed training keypoints (extracted using dlib and normalized, etc.) match up perfectly to the original images ... so it's not an issue with the keypoint extraction/dumping code.
The default pre-trained a2kp model works moderately well (at least it does not predict the same static output).
I'm using 50 address videos (as opposed to I think 20 in this repo?), so don't think it's an issue of insufficient data.
I'm using the default look back and time delay parameters from this repo. I train for about 50 epochs with a batch size of 1.
It does not appear that testing loss or validation loss seems to decrease nearly at all across epochs.
I'm using the audios from the demo data that you've included in this repo (e.g. kia.wav, 00002-007.wav`, etc.).

TL;DR: Does this seem like it might just be an issue of not training for enough epochs? Or might it be some bug, i.e. related to the audio timesteps not correctly broken up in predict.py or the PCA upsampling not working well for predicted keypoints?

Would appreciate any quick insights or hunches anyone might have on this or if folks could just verify that they've gotten replicable results simply from using the exact code in this repo. Thanks so much!

karanvivekbhargava commented 5 years ago

Hey @soubhikbarari , Apologies for getting back to you late. I'm glad you liked the work. As for your question, I was encountering some errors initially with the lstms too. Went through a rough patch figuring out the problems you were facing, however, I found out that the lstms were not connected temporally for my case. Apparently it's not very well documented on keras too. This particular architecture of a time delayed lstm was a bit difficult to figure out. After trying out the various options on the keras lstms for a couple of weeks, I finally got it to work decently. I know this isn't a satisfactory explanation but that's all I got.

Would be waiting on your reply to close this. Cheers!

soubhikbarari commented 5 years ago

That is great to hear! Could you share what your model architecture ended up being? Mine looks like:

model = Sequential()
model.add(LSTM(25, input_shape=(look_back, 26)))
model.add(Dropout(0.25))
model.add(Dense(8))
model.compile(loss='mean_squared_error', optimizer='adam')

karanvivekbhargava commented 5 years ago

The model you've mentioned is the exact unaltered model inside the train.py file (line 97 to line 102) in this repository. This was the final model I trained.

soubhikbarari commented 5 years ago

Ok, then it is likely not the model architecture itself (or look-back issues) that is causing my problem. This is helpful. Thank you!

makai281 commented 5 years ago

Ok, then it is likely not the model architecture itself (or look-back issues) that is causing my problem. This is helpful. Thank you!

I have the same problem with you! lstm is predicting the open mouth for every frame, even on the training set. Have you found a solution yet?

karanvivekbhargava commented 5 years ago

@makai281 Was this the same model as what @soubhikbarari mentioned?

makai281 commented 5 years ago

@makai281 Was this the same model as what @soubhikbarari mentioned?

yes and the preprocessing is the same as what you shared. But the default pre-trained lstm model works well, so i think data preprocessing is unlikely to have problems

wanshun123 commented 5 years ago

I have this issue too, though somehow if I train with only one video (n_videos = 1 in train.py) it will result in a model that predicts different mouth keypoints, same with n_videos = 5 (though with less variation than using only one video). Any idea what's going on here? As training also uses trimmed videos of only a few seconds in length I'm also a bit confused how to use the full dataset - if I set n_videos to 1970 (the total number of trimmed videos in my dataset, taken from 50 full weekly address videos) my computer freezes completely. With what I've trained so far I've only been able to use a tiny fraction of the processed dataset.

karanvivekbhargava commented 5 years ago

@wanshun123 I would be working on getting the different parts of this project into their own modules and add them as submodules. They will get their own training data and I hope things would be much clearer. As to the training. I won't lie, the model was difficult to train. I had to try a lot of different variations and this particular model seemed to work best (relatively). I'm open to other model suggestions to add to this repository.

liuyingbin123 commented 5 years ago

i get the same issue.Every fram get the static mouth shape.How should i train the time-delay lstm?thank you very much.

avilash commented 4 years ago

Hi @soubhikbarari - Any resolution on this?

getmlcode commented 3 years ago

i had resolved the issue discussed here in august 2020 for a freelancing project that i had undertaken, can't tell the exact solution due to agreement signed but the hint is to think of bias-variance carefully

Hujiazeng commented 1 year ago

我已经解决了 2020 年 8 月在这里讨论的关于我所从事的自由职业项目的问题，由于签署了协议，无法说出确切的解决方案，但提示是要仔细考虑偏差与方差

Did you check out of the office? talk is cheap, show me you code

karanvivekbhargava / obamanet

After training, audio-to-kp LSTM predicting identical "open-mouth" keypoints #10