Replicating paper results

iankur commented 9 months ago

Hi, I am trying to replicate the best results mentioned in the paper with unfrozen whisper features, mfcc and mesonet. EER that I get for model from stage 1 (encoder frozen) is close to that frozen case mentioned in the paper. But when I use this checkpoint for further finetuning, i.e. unfreeze encoder, I get EER of 0.35 which is much worse than the reported result.

Any pointers on what can be going wrong? I am following the instructions provided in the repo for running these experiments without any change.

piotrkawa commented 9 months ago

Hi, could you please provide the exact configs which you used? I will rerun the trainings and I will come back soon with the results.

Did you modify any codebase e.g. related to datasets? What results do you achieve on InTheWild dataset using the pretrained model we provide? Please provide us with as much information as possible to reproduce your results (including information about preparing the train and eval datasets).

iankur commented 9 months ago

@piotrkawa I did not modify the codebase, just followed the commands provided in readme. I am using config from here for stage 1 training and then modify the config for stage 2 finetuning by changing lr, freeze_encoder and ckpt path parameters. The exact command I used was as provided here in the readme (i changed epochs and config path for stage 1 and stage 2).

I am able to reproduce the numbers reported in the paper for InTheWild dataset with the provided checkpoint. Let me know for anything else.

piotrkawa commented 9 months ago

"i changed epochs and config path for stage 1 and stage 2)" - what do you exactly mean by that? 1st training should be performed using 10 epochs, fine-tuning using 5 epochs.

Could you please retry training, but this time using not test_amount but valid_amount = 25,000? That's a mistake in the README - test_amount is in fact not used, as we evaluate on full ITW dataset, whereas we should validate on 25k ASV21DF samples.

Moreover - what ASV21DF labels do you use? Please refer to https://github.com/piotrkawa/deepfake-whisper-features/issues/3#issuecomment-1731267029.

iankur commented 9 months ago

I used 10 epochs for 1st training and 5 epochs for finetuning as mentioned in the paper. I am not using test_amount as I mentioned earlier that I only adjusted the parameters according to experimental procedure mentioned in the paper. So, I am using 25000 for valid_amount. I am also using the key that was referred in the DeepFakeASVSpoofDataset class, which I think is the same one you linked above.

I can retrain the model again but it seems there is no change to my existing configuration. Let me know if you think otherwise or there are other changes that I should try.

chandlerbing65nm commented 9 months ago

Hi @piotrkawa,

I've encountered the same issue where I'm unable to replicate the results from the paper. Below are the details of my setup and the commands I've used:

Training Command

python train_models.py \
--asv_path /home/man-group/chandler/Datasets/ASVspoof2021/DF \
--config configs/training/whisper_mesonet.yaml \
--batch_size 8 \
--epochs 10 \
--train_amount 100000 \
--test_amount 25000

Training Configuration `(whisper_mesonet.yaml)`

data:
  seed: 42

checkpoint:
  path: ""

model:
  name: "whisper_mesonet"
  parameters:
    freeze_encoder: True
    input_channels: 1
    fc1_dim: 1024
    frontend_algorithm: []
  optimizer:
    lr: 0.0001
    weight_decay: 0.0001

Evaluation Command

python evaluate_models.py \
--in_the_wild_path /home/man-group/chandler/Datasets/release_in_the_wild \
--config ./configs/model__whisper_mesonet__1695441741.5227604.yaml \
--amount 25000

Evaluation Configuration `(model__whisper_mesonet__1695441741.5227604.yaml)`

checkpoint:
  path: /home/man-group/chandler/Experiments/deepfake-whisper-features/trained_models/model__whisper_mesonet__1695441741.5227604/ckpt.pth
data:
  seed: 42
model:
  name: whisper_mesonet
  optimizer:
    lr: 0.0001
    weight_decay: 0.0001
  parameters:
    fc1_dim: 1024
    freeze_encoder: true
    frontend_algorithm: []
    input_channels: 1

Results

From your paper, the EER(frozen) should be 0.3856.

However, I get this output below using the Evaluation Command:

eval/eer: 0.4117
eval/accuracy: 57.8361
eval/precision: 0.7281
eval/recall: 0.5249
eval/f1_score: 0.6100
eval/auc: 0.6129

chandlerbing65nm commented 9 months ago

Could you kindly assist me with the issue I mentioned above, @piotrkawa? I find the method in your paper to be potentially state-of-the-art and am planning to include it in our benchmark. However, I'm encountering difficulties in reproducing your results. Your guidance would be greatly appreciated.

piotrkawa commented 9 months ago

Thank you for the detailed description of the steps you followed to run the code. Let us take a look at the problem.

In the meantime, we point out the checkpoints with which we achieved the results described in the paper: The best (MFCC+Whisper) MesoNet & (Whisper) MesoNet models.

We can provide more checkpoints if needed.

What is the environment you are using?

OS,
CUDA version, drivers,
exact GPU,
Python version,
packages like torch, Whisper etc.

In the meantime, we would appreciate your results of the models not using Whisper - e.g. (LFCC) SpecRNet.

chandlerbing65nm commented 9 months ago

@piotrkawa This is the machine environment I have:

(audiofake) man-group@mangroup-1:~/chandler$ uname -a
Linux mangroup-1 6.2.0-33-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep  7 10:33:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

(audiofake) man-group@mangroup-1:~/chandler$ nvidia-smi
Python 3.10.13

(audiofake) man-group@mangroup-1:~/chandler$ nvidia-smi
Thu Sep 28 21:42:23 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   46C    P8              22W / 420W |    140MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1888      G   /usr/lib/xorg/Xorg                           71MiB |
|    0   N/A  N/A      2022      G   /usr/bin/gnome-shell                         58MiB |
+---------------------------------------------------------------------------------------+

(audiofake) man-group@mangroup-1:~/chandler/Experiments/deepfake-whisper-features$ pip freeze
asteroid-filterbanks==0.4.0
audioread==3.0.0
beautifulsoup4==4.12.2
bleach==6.0.0
brotlipy==0.7.0
cachetools==5.3.1
certifi @ file:///croot/certifi_1690232220950/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
cryptography @ file:///croot/cryptography_1694444244250/work
decorator==5.1.1
ffmpeg-python==0.2.0
filelock==3.12.4
fsspec==2023.9.1
future==0.18.3
gdown==4.7.1
google-api-core==2.12.0
google-api-python-client==2.101.0
google-auth==2.23.1
google-auth-httplib2==0.1.1
googleapis-common-protos==1.60.0
httplib2==0.22.0
huggingface-hub==0.17.2
idna @ file:///croot/idna_1666125576474/work
joblib==1.3.2
kaggle==1.5.16
librosa==0.9.2
llvmlite==0.40.1
mkl-fft @ file:///croot/mkl_fft_1695058164594/work
mkl-random @ file:///croot/mkl_random_1695059800811/work
mkl-service==2.4.0
more-itertools==10.1.0
numba==0.57.1
numpy==1.24.4
openai-whisper @ git+https://github.com/openai/whisper.git@7858aa9c08d98f75575035ecd6481f462d66ca27
packaging==23.1
pandas==2.0.2
Pillow @ file:///croot/pillow_1695134008276/work
platformdirs==3.10.0
pooch==1.7.0
protobuf==4.24.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pyOpenSSL @ file:///croot/pyopenssl_1690223430423/work
pyparsing==3.1.1
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.8.2
python-slugify==8.0.1
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.8.8
requests @ file:///croot/requests_1690400202158/work
resampy==0.4.2
rsa==4.9
safetensors==0.3.3
scikit-learn==1.3.1
scipy==1.11.2
six==1.16.0
soundfile==0.12.1
soupsieve==2.5
text-unidecode==1.3
threadpoolctl==3.2.0
tokenizers==0.13.3
torch==1.11.0
torchaudio==0.11.0
torchvision==0.12.0
tqdm==4.66.1
transformers==4.33.2
typing_extensions @ file:///croot/typing_extensions_1690297465030/work
tzdata==2023.3
uritemplate==4.1.1
urllib3==2.0.5
webencodings==0.5.1

Replication [w/o Whisper]

Training Command

python train_models.py \
--asv_path /home/man-group/chandler/Datasets/ASVspoof2021/DF \
--config configs/training/specrnet.yaml \
--batch_size 8 \
--epochs 10 \
--train_amount 100000 \
--test_amount 25000

Training Configuration `(specrnet.yaml)`

data:
  seed: 42

checkpoint:
  path: ""

model:
  name: "specrnet"
  parameters:
    input_channels: 1
    frontend_algorithm: ["lfcc"]
  optimizer:
    lr: 0.0001
    weight_decay: 0.0001

Evaluation Command

python evaluate_models.py \
--in_the_wild_path /home/man-group/chandler/Datasets/release_in_the_wild \
--config ./configs/model__specrnet__1695874202.629074.yaml \
--amount 25000

Evaluation Configuration `(modelspecrnet1695874202.629074.yaml)`

checkpoint:
  path: /home/man-group/chandler/Experiments/deepfake-whisper-features/trained_models/model__specrnet__1695874202.629074/ckpt.pth
data:
  seed: 42
model:
  name: specrnet
  optimizer:
    lr: 0.0001
    weight_decay: 0.0001
  parameters:
    frontend_algorithm:
    - lfcc
    input_channels: 1

Results

From your paper, the (SpecRNet+ LFCC) EER should be 0.5184.

However, I get this output below using the Evaluation Command:

- eval/eer: 0.6368
- eval/accuracy: 34.0634
- eval/precision: 0.3027
- eval/recall: 0.0381
- eval/f1_score: 0.0676
- eval/auc: 0.3068

piotrkawa commented 9 months ago

Thank you for your patience. We have indeed noticed the reproducibility issues on other machines.

We tried scenarios of downloading datasets once again, running code on new conda env, and outside of the Docker env we originally worked on and on our machine - for all cases, we achieved the same results.

While the results are reproducible on the computer we used to prepare this work, they are not on other machines (https://discuss.pytorch.org/t/different-result-on-different-gpu/102502). This is also confirmed as @iankur achieved similar results (at least for 1st stage), whereas @chandlerbing65nm achieved much different results.

That is why we conclude that the problem may lie in the hardware rather than some mismatch in the dataset or a bug in the codebase.

The specs of the machine used to prepare the paper are as follows:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN RTX               Off | 00000000:1A:00.0 Off |                  N/A |
| 41%   33C    P8               5W / 280W |    769MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX               Off | 00000000:1B:00.0 Off |                  N/A |
| 40%   43C    P8              15W / 280W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN RTX               Off | 00000000:1E:00.0 Off |                  N/A |
|123%   77C    P2             259W / 280W |   7507MiB / 24576MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN RTX               Off | 00000000:3F:00.0 Off |                  N/A |
| 41%   30C    P8              13W / 280W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA TITAN RTX               Off | 00000000:40:00.0 Off |                  N/A |
| 40%   32C    P8              11W / 280W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

requirements:

(mldev) $ pip freeze
asteroid-filterbanks==0.4.0
audioread==3.0.1
brotlipy==0.7.0
certifi @ file:///croot/certifi_1690232220950/work/certifi
cffi @ file:///croot/cffi_1670423208954/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
cryptography @ file:///croot/cryptography_1694444244250/work
decorator==5.1.1
ffmpeg-python==0.2.0
filelock==3.12.4
fsspec==2023.9.2
future==0.18.3
huggingface-hub==0.17.3
idna @ file:///croot/idna_1666125576474/work
joblib==1.3.2
librosa==0.9.2
llvmlite==0.41.0
mkl-fft @ file:///croot/mkl_fft_1695058164594/work
mkl-random @ file:///croot/mkl_random_1695059800811/work
mkl-service==2.4.0
more-itertools==10.1.0
numba==0.58.0
numpy==1.25.2
openai-whisper @ git+https://github.com/openai/whisper.git@7858aa9c08d98f75575035ecd6481f462d66ca27
packaging==23.1
pandas==2.0.2
Pillow @ file:///croot/pillow_1695134008276/work
platformdirs==3.10.0
pooch==1.7.0
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pyOpenSSL @ file:///croot/pyopenssl_1690223430423/work
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.8.8
requests @ file:///croot/requests_1690400202158/work
resampy==0.4.2
safetensors==0.3.3
scikit-learn==1.3.1
scipy==1.11.3
six==1.16.0
soundfile==0.12.1
threadpoolctl==3.2.0
tokenizers==0.13.3
torch==1.11.0
torchaudio==0.11.0
torchvision==0.12.0
tqdm==4.66.1
transformers==4.33.3
typing_extensions @ file:///croot/typing_extensions_1690297465030/work
tzdata==2023.3
urllib3 @ file:///croot/urllib3_1686163155763/work

As mentioned, we retried the (LFCC) SpecRNet experiment on another machine. We used RTX 3090 (Driver Version: 470.141.03, CUDA Version: 11.4). We got the same results as you reported (eer=0.6368 etc.). Moreover, during this investigation, we noticed that setting other seeds may result in better results: e.g. on RTX 3090 training with seed=1234 resulted for (LFCC) SpecRNet in EER of 0.5855 (where 0.6368 was the result for the seed=42 we typically use in our research). The discrepancy is not an issue of only Whisper-based architectures we propose in this work, as it appears, for instance, in LCNN models as well - we got EER=0.7051 for (LFCC) LCNN architecture on RTX 3090 instead of 0.77 on TITAN RTX (reported in paper).

We provide all of the models reported in our paper for clarity and as a confirmation of our results.

We will further investigate this issue, however at this moment, we can suggest the following solutions to reproduce our results: 1) use the models we provide, 2) train using different seeds (you can expect both better and worse results in relation to the reported ones), 3) run training using the exact environment and machine as we did, 4) try lower learning rate.

Moreover, please note that in our research we based on 125k training samples with no use of any augmentation - to further enhance these results one can use full datasets and apply data augmentation techniques.

chandlerbing65nm commented 9 months ago

Hi @piotrkawa. Firstly, I want to express my gratitude for your detailed explanation concerning the reproducibility issues I encountered with your paper. Your comprehensive insights and the troubleshooting steps you've provided are invaluable.

I understand the challenges associated with ensuring the reproducibility of results across different hardware configurations. To mitigate this and to allow for a more universal benchmark, would it be possible to rerun the experiments in Table 3, specifically those involving frozen and fine-tuned Whisper features, using multiple different seeds? Calculating the average performance metrics along with their standard deviation would provide a more reliable measure of the model's capabilities.

If this is an additional task that you're unable to undertake at the moment, I would be more than willing to run these experiments on our end and provide you with the updated results. This collaborative effort would contribute to the robustness and credibility of the published work.

Importantly, this would enable us to cite your work in our upcoming benchmarks. We would point to the updated results on your GitHub page as the source, rather than the original paper, given the updated nature of the findings.

We look forward to seeing these updated results on GitHub, which would serve as an essential resource for all researchers in this field.

Thank you once again for your time and significant contributions to this field.

iankur commented 9 months ago

@piotrkawa not related to reproducibility, is the same batch norm layer applied for two different inputs here and here intended? I just found out this issue has been raised with the pytorch implementation that this repo reuses. Also, correct me but there seems to be no activation function in the inception layer whereas the original keras implementation of meso inception net does have relu nonlinearity.

Can you also please share why this work does not compare with wav2vec and graph attention based systems?

piotrkawa commented 9 months ago

Hi @chandlerbing65nm, Unfortunately, I have to focus on my PhD dissertation currently and will not be able to rerun these experiments in a short period of time, but I am encouraging you to do so and run the trainings on different machines and seeds.

However, please bear in mind that, as we state in our paper, we only used a subset of the ASVspoof 2021 DF dataset and no augmentation techniques - we put the main focus on the benefits of using the Whisper model and perform different comparisons with other front-ends. Exposing models to multiple attacks (i.e. using multiple datasets like ASVspoofs, WaveFake, FakeAVCeleb, ADD etc.) and enhancing the representation using augmentation techniques (e.g. audiomentations, RawBoost etc.) is commonly used in the field, as it significantly improves the models. In my opinion, the benchmark would benefit from the unified training procedure - i.e. similar training datasets and training techniques as these factors will be highly influencial.

piotrkawa commented 9 months ago

@iankur we used the implementation of MesoNet provided with FakeAVCeleb baseline code.

We cited the methods you mentioned; however, we did not include them in the benchmark due to the manuscript space limit. Moreover, a reliable comparison would require training these models using the same environment (training set, number of epochs etc.) - we did not want to dilute the paper by adding another few models.

piotrkawa / deepfake-whisper-features