rizkiarm / LipNet

Keras implementation of 'LipNet: End-to-End Sentence-level Lipreading'
MIT License
643 stars 229 forks source link

Videos seen by model each epoch #14

Open michiyosony opened 7 years ago

michiyosony commented 7 years ago

My (very potentially incorrect) understanding of an "epoch" is a set of iterations over which the model is exposed to each item in the training set one time.

In trying to understand the system better, I created a very small training set composed of

s1/
    s1lbax4n
    s1swwp2s
    s1pwij3p
    s1bbaf2n
s2/
    s2lbax4n
    s2swwp2s
    s2pwij3p
    s2bbaf2n

and the corresponding .align files.

I modified unseen_speakers/train.py to train using the line

train(run_name, 0, 1, 3, 100, 50, 75, 32, 2)

so training would run for 1 epoch on a batch size of 2.

My output looks like this:

epoch is: 0
Epoch 0: Curriculum(train: True, sentence_length: -1, flip_probability: 0.5, jitter_probability: 0.05)
Train [0,1] 0:2
Epoch 1/1
epoch is: 0
Epoch 0: Curriculum(train: True, sentence_length: -1, flip_probability: 0.5, jitter_probability: 0.05)
Train [0,1] 2:4
In Curriculum.apply: NOT flipping video s2/s2swwp2s
In Curriculum.apply: NOT flipping video s1/s1lbax4n
In Curriculum.apply: NOT flipping video s1/s1swwp2s
In Curriculum.apply: flipping video s1/s1bbaf2n
Train [0,0] 4:6
Train [0,0] 6:8
In Curriculum.apply: flipping video s1/s1pwij3p
In Curriculum.apply: NOT flipping video s2/s2bbaf2n
In Curriculum.apply: NOT flipping video s2/s2lbax4n
In Curriculum.apply: NOT flipping video s2/s2pwij3p
Train [0,0] 0:2
Train [0,0] 2:4
In Curriculum.apply: flipping video s1/s1lbax4n
In Curriculum.apply: NOT flipping video s2/s2swwp2s
In Curriculum.apply: NOT flipping video s1/s1swwp2s
In Curriculum.apply: NOT flipping video s1/s1bbaf2n
Train [0,0] 4:6
Train [0,0] 6:8
In Curriculum.apply: NOT flipping video s1/s1pwij3p
In Curriculum.apply: flipping video s2/s2bbaf2n
In Curriculum.apply: flipping video s2/s2lbax4n
In Curriculum.apply: flipping video s2/s2pwij3p
1/4 [======>.......................] - ETA: 255s - loss: 191.3861Train [0,0] 0:2
In Curriculum.apply: flipping video s2/s2swwp2s
In Curriculum.apply: NOT flipping video s1/s1swwp2s

2/4 [==============>...............] - ETA: 168s - loss: 183.9747Train [0,0] 2:4
In Curriculum.apply: flipping video s1/s1lbax4n
In Curriculum.apply: flipping video s1/s1bbaf2n

3/4 [=====================>........] - ETA: 83s - loss: 180.0006 Train [0,0] 4:6
In Curriculum.apply: flipping video s1/s1pwij3p
In Curriculum.apply: NOT flipping video s2/s2lbax4n
epoch is: 0
Epoch 0: Curriculum(train: False, sentence_length: -1, flip_probability: 0.5, jitter_probability: 0.05)
epoch is: 0
Epoch 0: Curriculum(train: False, sentence_length: -1, flip_probability: 0.5, jitter_probability: 0.05)
epoch is: 0
Epoch 0: Curriculum(train: False, sentence_length: -1, flip_probability: 0.5, jitter_probability: 0.05)

[Epoch 0] Out of 256 samples: [CER: 30.250 - 1.440] [WER: 6.000 - 1.000] [BLEU: 0.325 - 0.325]

/Users/michiyosony/tensorflow/lib/python2.7/site-packages/nltk/translate/bleu_score.py:472: UserWarning: 
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)

4/4 [==============================] - 1326s - loss: 173.2259 - val_loss: 145.0103

Process finished with exit code 0

Why does it appear that the model is exposed to 22 videos during the first epoch? From the paper, I would have expected 16 (the 8 training videos + 8 horizontally flipped training videos).

The 16 original videos loaded can be seen (organized) here (asterisks added):

In Curriculum.apply: flipping video s1/s1bbaf2n
In Curriculum.apply: flipping video s1/s1pwij3p
In Curriculum.apply: flipping video s1/s1lbax4n
**In Curriculum.apply: NOT flipping video s1/s1swwp2s**
**In Curriculum.apply: NOT flipping video s1/s1swwp2s**
In Curriculum.apply: NOT flipping video s1/s1bbaf2n
In Curriculum.apply: NOT flipping video s1/s1pwij3p
In Curriculum.apply: NOT flipping video s1/s1lbax4n

In Curriculum.apply: flipping video s2/s2bbaf2n
In Curriculum.apply: flipping video s2/s2lbax4n
In Curriculum.apply: flipping video s2/s2pwij3p
**In Curriculum.apply: NOT flipping video s2/s2swwp2s**
**In Curriculum.apply: NOT flipping video s2/s2swwp2s**
In Curriculum.apply: NOT flipping video s2/s2bbaf2n
In Curriculum.apply: NOT flipping video s2/s2lbax4n
In Curriculum.apply: NOT flipping video s2/s2pwij3p

In Curriculum.py I can see that each video has a 50% chance of being flipped horizontally. This looks like a slightly different implementation of "...we train on both the regular and the horizontally mirrored image sequence." (LipNet). Is there a motivation for leaving it to chance whether both a video and its mirror will be included (as opposed to the same video twice, as seen in the asterisked examples above)?

rizkiarm commented 7 years ago

Because the program employed multiprocessing when loading the data, which is then first aggregated in Queue, before actually being consumed by the model. Therefore, counting the amount of data consumed by the model at each epoch from seeing the amount of hit in generator's method is not reliable. The program delegates model feeding to Keras, so there should be no problem. Btw, I might have a wrong understanding of the output that you've shown, so do clarify it when you think I'm wrong.

Yeah, it is a deviation from the paper. The main reason on why I leave that to chance is two-fold. The first one is to ensure non-determinism in per epoch training. The second one is to avoid model overfitting because of the mirrored duplicates, which looks almost identical due to the GRID data which contains frontal-view of the person. I might be wrong about this, so further investigation is needed.