mpc001 / end-to-end-lipreading

Pytorch code for End-to-End Audiovisual Speech Recognition
173 stars 51 forks source link

training audiovisual net with and without pretrained models #27

Open msanchez-fi opened 4 years ago

msanchez-fi commented 4 years ago

Hello, I have some doubts about the process of training the audiovisual model. Currently, I am following the steps indicated on the README going from temporalconv, backend, and later finetuneGRU to train the whole network. I have some questions about the process:

Training with pre-trained models:

  1. When traininig using the pre-trained models. The net got stuck in this part : inputs = torch.cat((audio_outputs, video_outputs), dim=2)

    It requires 3D tensors to concatenate them, but the "video_outputs" and "audio_outputs" that I got are 2D tensors such as [B, 500]. How these 3D tensors for audio and video should look like? Is there a code missing or should something to transform them as required by the net?

    These are the tensors I got for conv1 and conv2 backend. The tensors of backend conv1 are not the same. Those should be equal? initial audio tensor: [B,19456] audio after backend conv1: [B,2048,1] audio after backend conv2(final): [B,500] initial video tensor : [B,1,29,96,96] video after backend conv1: [B,1024,1] video after backend conv2(final): [B,500]

  2. The concat_pretrained model is actually 3 files named as _a.pt,_b.pt,_v.pt all those three should be merge and use as a input? along with audio and video models to start the training with temporal convolutional backend? or Which one should I consider first?

Training from scratch:

  1. I started the process of training from scratch without the pre-trained models. But only the audiovisual net because this is the part I am interested in. Is that the correct approach? Should I train from the scratch also the audio-only and video-only models first?

  2. If I train also the audio-only and video-only models, should I use the .pt files of the last phase "finetuneGRU" of the audio and video net as an inputs (pt files) for the training of the audiovisual net? Besides that how could I get the concat_model.pt? How should look like the temporalConv training for audivisual with these inputs (audio_model.pt, video_model.pt & concat_model.pt)?

  3. On the README, the step ii mentioned "Throw away the temporal convolutional backend, freeze the parameters of the frontend and the ResNet and train the LSTM backend" is this not already somehow specify on the code?

mpc001 commented 4 years ago

Hi, thanks for your interests,

We trained audiovisual models including two steps: 1) loading the pre-trained audio and visual module, freezing both modules and training the top few layers. 2) finetuning all layers with a smaller learning rate of 1e-4. You could assign mode with backendGRU finetuneGRU to achieve both.

The bottom layers are initialized with the weights in audio-only and visual-only training. If training audiovisual model from scratch, the result is lower.

jiarouk commented 3 years ago

Hello, Same question here:The concat_pretrained model is actually 3 files named as _a.pt,_b.pt,_v.pt all those three should be merge and use as a input? along with audio and video models to start the training with temporal convolutional backend? or Which one should I consider first? Also should we train the network with temporal conv? In the paper you mention that the first step in audiovisual is " another 2-layer BGRU is added on top of all streams in order to fuse the single stream outputs." So should audiovisual start from step 2?