polimi-ispl / icpr2020dfdc

Video Face Manipulation Detection Through Ensemble of CNNs
GNU General Public License v3.0
258 stars 99 forks source link

Source code of Ensemble models #65

Closed thaondc closed 3 years ago

thaondc commented 3 years ago

Hi @CrohnEngineer, I followed your tutorial, I'm training ensemble model but I can not find your code to combine different networks in this repository ( Ex: combine EfficientNet B4 and EfficientNet AutoAtt B4). Can you help me to find them?

CrohnEngineer commented 3 years ago

Hey @thaondc ,

If you need the code for computing the results we have obtained in the paper combining different models, take a look at the Analyze result net fusion notebook.
Instead, if you want a reference on how to have an ensemble model working at test time, I think our Kaggle notebook might be more helpful (the explanation of what we are doing is here).
In the end our approach is not particularly fancy: we just take all the models in the ensemble, load them, process the input sequentially with each model, and then average the scores obtained by each network. Hope this helps! Bests,

Edoardo

thaondc commented 3 years ago

Hi @CrohnEngineer, I'm so sorry. I was infected with covid two weeks ago so I can not check notification from github. Thank you for your reply. I will study base on your recommendation.

CrohnEngineer commented 3 years ago

Hey @thaondc ,

Don't worry, hope you and your beloved ones are fine and healty! I've closed the issue but feel free to ask other questions here or in another thread (also, in the past issues there are some problems we have faced through time). Bests,

Edoardo

thaondc commented 3 years ago

Hi @CrohnEngineer,

"Instead, if you want a reference on how to have an ensemble model working at test time, I think our Kaggle notebook might be more helpful (the explanation of what we are doing is here)."

I can not access the url in [here]. I think this discussion was closed. Can you help me to check it?

CrohnEngineer commented 3 years ago

Hey @thaondc ,

I can access it with no problem even without signing in Kaggle (by the way, the discussion is not closed, so you should be able to see the page with no problems). In any case, the link re-directed to a post by Nicolò which explained how our approach worked. The part you're interested in is how we do model ensembling in the Kaggle notebook, I'll report it down here:

Inference

At inference time, we considered 72 frames per video and looked at all the faces found by Blazeface, keeping only those with a score above a certain threshold. In case a frame had more than one face above the threshold, but with discordant scores, we took the maximum score above them. The rationale is that if we have multiple faces and just one face is fake, we want to classify the frame as fake. We then averaged the scores of all the networks, the scores of all the frame of the video and computed the sigmoid.

thaondc commented 3 years ago

Thank you so much, @CrohnEngineer. I will refer to this inference for my work.

thaondc commented 2 years ago

Hi @CrohnEngineer,

I have referred to the paper, the code and the Kaggle discussion of your team. And I had some questions. May I ask some questions?

1/ In the paper, section I, it said that: "Specifically, we consider the strong hardware and time constraints imposed by the DFDC [18]. This means that the proposed solution must be able to analyze 4000 videos in less than 9 hours using at most a single NVIDIA P100 GPU. Moreover, the trained models must occupy less than 1GB of disk space". I think "4000 videos" mentioned is public test dataset on Kaggle. But the public test dataset on Kaggle has only 400 videos. May I ask you which dataset is "4000 videos"? 2/ In the paper, section IV.C, it said that: "for the end-to-end training, we either train for a maximum of 20k iterations, indicating as iteration the processing of a batch of 32 faces (16 real, 16 fake) taken randomly and evenly across all the videos of the train split, or until reaching a plateau on the validation loss. Validation of the model in this context is performed every 500 training iterations, on 6000 samples taken again evenly and randomly across all videos of the validation set. The initial learning rate is reduced of a 0.1 factor if the validation loss does not decrease after 10 validation routines (5000 training iterations), and the training is stopped when we reach a minimum learning rate of 1 × 10^(-10);" Does it mean you trained them for a routine (not an entire epoch) instead of an epoch? Did you take 6000 samples for each validation iteration? May I ask you how many routines and iterations you trained? 3/ In your team's discussion on Kaggle, you said that: "At inference time, we considered 72 frames per video and looked at all the faces found by Blazeface, keeping only those with a score above a certain threshold.". But in the paper, you have chosen 32 frames. May I ask you how many frames did you take per video?

Thank you.

nicobonne commented 2 years ago

Hi @thaondc, 1) The 4000 videos dataset was the private Kaggle test set, not the public. We didn't have access to this dataset during the competition, I think that now those videos are part of the full DFDC. 2) As stated in the paper, an entire epoch was too long to finish in a reasonable time. We instead trained looking for a validation plateau on a random 6k validation samples, or stopping after 20k training batches. 3) We trained on 32 frames but at inference time the number of frames was choosen to match the time/memory constraints of the competition's platform. We saw that with 72 frames per video we did not saturate the NVIDIA P100 memory and we could process all the 4k videos in less than 9h, so we went for it.

thaondc commented 2 years ago

Hi @nicobonne,

Thank you for your answer. But I still have some questions. Can you help me to answer?

  1. "We instead trained looking for a validation plateau on a random 6k validation samples, or stopping after 20k training batches." Does it mean you used 6k random validation samples for a validation routine? Besides, I trained many "routines" but the model is not converging, which means the learning rate is not less than 10^(-10). Can I ask you how many average routines did you train?
CrohnEngineer commented 2 years ago

Hey @thaondc ,

Does it mean you used 6k random validation samples for a validation routine?

Yes that's true, with the samples taken from videos of the folders 35 to 40 of the train split.
The actual number of batches (and therefore iterations) is lower depending on the batch dimension of course.

Besides, I trained many "routines" but the model is not converging, which means the learning rate is not less than 10^(-10). Can I ask you how many average routines did you train?

Usually the models overfitted long before the 20k batch iterations, around 10k or so they converged.
A small tip: for "convergence", we mean that the network's validation loss has not changed significantly over the last iterations. If I were you, I would look manually at the validation loss: there might be small fluctuations that change the value of the min loss reached, but that are not really significant in order of magnitude. Those fluctuations might cause the training script not to stop or reduce the learning rate, even though the network has already reached good validation loss values. Hope this help! Bests,

Edoardo

thaondc commented 2 years ago

Thank you so much for your help, @nicobonne @CrohnEngineer. I will try to do it again.

thaondc commented 2 years ago

Hi @CrohnEngineer,

I read your source code and try to create another version for myself. But I have some problems, I don't understand these code lines, in train_binclass.py, from line 336 to 347, what did they mean? And the function: tb_attention(), in train_binclass.py, in line 371, what did it do? Can you help me to answer?

Thanks.

CrohnEngineer commented 2 years ago

Hey @thaondc ,

The function tb_attention() logs the attention masks computed by EfficientNetAutoAtt models on Tensorboard for visualization purposes.
Lines from 336 to 347 simple compute the attention masks for a couple of frames (one REAL and one FAKE), which are then logged using tb_attention(). In this way you can check how the attention masks vary during the training of the EfficientNetAutoAtt models.
Nothing more :)
Glad you're digging the code and making your own version! Good luck with your experiments! Bests,

Edoardo

thaondc commented 2 years ago

Thank you @CrohnEngineer. After I received your response, I have started training my own model. But I had a question, did you use pre-trained weights? For example, did you use EfficientNet's weights trained on ImageNet? Do you think I should use pre-trained weights trained on ImageNet?

CrohnEngineer commented 2 years ago

Yes we did use pre-trained weights on ImageNet!
We tried also training from scratch the models, but the results were not as good as the ones obtained with models pre-trained on ImageNet.
This is an interesting topic of discussion as all the top-place solutions relied on pre-trained models, but it is still not clear how pre-training on ImageNet can provide an advantage from a forensic perspective.

thaondc commented 2 years ago

Thank you.

thaondc commented 2 years ago

Hi @CrohnEngineer, I am trying to train a model on the FaceForensics++ dataset. Can I ask you some questions? 1/ In this dataset, you selected 720 videos for training, 140 for validation, and 140 for the test from the pool of original sequences. Did you take the videos that are ordered in the data frame or take them randomly?

image

2/ For each video, Did you take a real version from Youtube folder? For fake versions, how many fake versions do you take from FaceSwap, Face2Face, DeepFakes and NeuralTextures?

CrohnEngineer commented 2 years ago

Hey @thaondc ,

  1. We first shuffled randomly the rows of the DataFrame containing the info about the FF++ videos. We then took only the originals, put 720 in the training set, 140 in the validation and 140 in the test set. You can see it in the splits.py file starting from line 54 to line line 61;
  2. For each original video in the various splits, we took a real version from YouTube, and then all the fake versions. In splits.py these are the lines from 62 to 69.

Hope this helps! Bests,