Preprocessing videos: paper vs this implementation

I'm reading through the LipNet paper and trying to determine whether we're doing the preprocessing described there.

The things that stood out to me were:

“we train on both the regular and the horizontally mirrored image sequence” The implementation seems a bit different--see the last question here.
“We augment the sentence-level training data with video clips of individual words as additional training instances. These instances have a decay rate of 0.925” This looks like it isn't currently in place in the non-curriculum training, since 'sentence_length' is -1 in unseen_speakers/train.py and overlapped_speakers/train.py, but by modifying this value we have the capability to train on different sentence lengths (though currently each epoch would have sentences of all the same length?)
“To encourage resilience to varying motion speeds by deletion and duplication of frames, this is performed with a per-frame probability of 0.05” This looks done, with the deletion/duplication code in in videos.temporal_jitter()!
“We standardize the RGB channels over the whole training set to have zero mean and unit variance” I found the line X_data = np.array(X_data).astype(np.float32) / 255 # Normalize image data to [0,1], TODO: mean normalization over training data in generators.py, sounds like it needs to be done :)

Is this an accurate description of the state of the project?

rizkiarm / LipNet