According to the paper, the cropped faces of subject's video are taken as input to the video network. My question is about the preprocessing pipeline. Which method or tool do you use to crop faces? How to align these faces? How to deal with the case where the facial landmark detection failed?
In data_provider.py, a sample is composed by the [frame, audio_sample, lable, subject_id] tensors, but in data_generator.py, there is no way to compute the frame tensor.
According to the paper, the cropped faces of subject's video are taken as input to the video network. My question is about the preprocessing pipeline. Which method or tool do you use to crop faces? How to align these faces? How to deal with the case where the facial landmark detection failed? In data_provider.py, a sample is composed by the [frame, audio_sample, lable, subject_id] tensors, but in data_generator.py, there is no way to compute the frame tensor.