rizkiarm / LipNet

Keras implementation of 'LipNet: End-to-End Sentence-level Lipreading'
MIT License
635 stars 226 forks source link

Preprocessing videos: paper vs this implementation #15

Open michiyosony opened 7 years ago

michiyosony commented 7 years ago

I'm reading through the LipNet paper and trying to determine whether we're doing the preprocessing described there.

The things that stood out to me were:

  1. “we train on both the regular and the horizontally mirrored image sequence” The implementation seems a bit different--see the last question here.

  2. “We augment the sentence-level training data with video clips of individual words as additional training instances. These instances have a decay rate of 0.925” This looks like it isn't currently in place in the non-curriculum training, since 'sentence_length' is -1 in unseen_speakers/train.py and overlapped_speakers/train.py, but by modifying this value we have the capability to train on different sentence lengths (though currently each epoch would have sentences of all the same length?)

  3. “To encourage resilience to varying motion speeds by deletion and duplication of frames, this is performed with a per-frame probability of 0.05” This looks done, with the deletion/duplication code in in videos.temporal_jitter()!

  4. “We standardize the RGB channels over the whole training set to have zero mean and unit variance” I found the line X_data = np.array(X_data).astype(np.float32) / 255 # Normalize image data to [0,1], TODO: mean normalization over training data in generators.py, sounds like it needs to be done :)

Is this an accurate description of the state of the project?

rizkiarm commented 7 years ago

Yeah, all of that is correct.

(2) I have tried to clarify the meaning of the "decay rate" to the paper's author, but haven't gotten the answer yet. The current implementation allows us to tweak the length of the sentence arbitrarily, as to allow for more general variations of training. You may use curriculum learning to define how that should behave.

(4) This one should be very simple to implement. I leave the implementation that way to minimize the amount of computation that needs to be performed before each experiment. And I also personally don't think that it was required, as we would have computed per batch normalized data using Batch Normalization.