Currently we select the first 16 frames of the video and run face detection against those.
We do this because there's no reason to believe that any portion of the video (beginning, middle, end etc.) is more or less likely to have strong evidence of deep fakes. Therefore if we use the beginning of the video, it's much quicker than trying to decode frames further into the video.
One alternative strategy might be to pay the cost of decoding the entire video and run face detection against the entire video. Then we could select the portion of the video in which we "best" detected faces. The portion of the video that is easiest for us to find faces is likely the portion of the video for which it was easiest for the deep fake algorithm to find faces as well. This may give us the strongest evidence of whether or not a video has a deep fake in it.
Currently we select the first 16 frames of the video and run face detection against those.
We do this because there's no reason to believe that any portion of the video (beginning, middle, end etc.) is more or less likely to have strong evidence of deep fakes. Therefore if we use the beginning of the video, it's much quicker than trying to decode frames further into the video.
One alternative strategy might be to pay the cost of decoding the entire video and run face detection against the entire video. Then we could select the portion of the video in which we "best" detected faces. The portion of the video that is easiest for us to find faces is likely the portion of the video for which it was easiest for the deep fake algorithm to find faces as well. This may give us the strongest evidence of whether or not a video has a deep fake in it.