scu-zjz / IMDLBenCo

[NeurIPS'24 Spotlight] A comprehensive benchmark & codebase for Image manipulation detection/localization.
https://scu-zjz.github.io/IMDLBenCo-doc
Creative Commons Attribution 4.0 International
76 stars 12 forks source link

The inconsistent inference results #34

Open kasteric opened 2 months ago

kasteric commented 2 months ago

Hi, I found that for the same checkpoints of IML-ViT, the inference results on CASIA v1 inferenced through this IMDB-IML-ViT framework is much lower (~12%) that computed within of the original code base framework IML-ViT (~70%).

SunnyHaze commented 2 months ago

Thanks for your attention to our project! And sorry for the delay.

Sorry for the misleading results. Could you please attach the corresponding log for each experiment for analysis and locate the issue for us?

kasteric commented 2 months ago

I located the issue, the data augmentations are inconsistent. For trained checkpoints, you have used resize with padding, for testing evaluation code in IMDLBenCo, you have used resize without padding, thus the data distribution is not aligned. On my custom dataset, I found that resize without padding works better than resize with padding. Did you observe similar results?

SunnyHaze commented 2 months ago

Hi, Anyway, if you utilize the demo_test_iml_vit.sh generated from the command benco init model_zoo, I believe it will be resized with padding with the parameter if_padding.

https://github.com/scu-zjz/IMDLBenCo/blob/e4f59d0d01bc85326eceefa0061788d187870670/IMDLBenCo/statics/model_zoo/runs/demo_test_iml_vit.sh#L1-L20

I am not sure how you train IML-ViT with resize without padding. Since the original code design only supports 1024x1024 input. Mostly we don't apply a traditional resize but just keep the raw resolution and pad the image to 1024x1024. Could you please specify the detailed implementation here for discussion? Thank you very much.

kasteric commented 2 months ago

Oh, I was not using demo_test_iml_vit.sh for testing, but just use demo_train_iml_vit.sh for evaluation, where I put the evaluation code before the training code in the train.py. In the generated demo_train_iml_vit.sh, I believe the configs are like:

base_dir="./output_dir_imlvit_orig"
mkdir -p ${base_dir}

CUDA_VISIBLE_DEVICES=1 \
torchrun  \
    --standalone    \
    --nnodes=1     \
    --nproc_per_node=1 \
../train.py \
    --model IML_ViT \
    --edge_lambda 20 \
    --vit_pretrain_path ../mae_pretrain_vit_base.pth \
    --world_size 1 \
    --batch_size 3 \
    --data_path  /<casia_v2> \
    --epochs 200 \
    --lr 1e-4 \
    --image_size 1024 \
    --if_resizing \
    --min_lr 5e-7 \
    --weight_decay 0.05 \
    --edge_mask_width 7 \
    --test_data_path /<casia_v1> \
    --warmup_epochs 2 \
    --output_dir ${base_dir}/ \
    --log_dir ${base_dir}/ \
    --accum_iter 8 \
    --seed 42 \
    --test_period 4 \
    --resume /<resumed.pth>

where if_resizing is set to True, and the data_transform would be resize without padding, like in the code below:

self.post_transform = None
        if is_padding == True:
            self.post_transform = get_albu_transforms(type_ = "pad", output_size = output_size)
        if is_resizing == True:
            self.post_transform = get_albu_transforms(type_ = "resize", output_size = output_size)

After I manually set the ablu_transform type to "pad", the results are consistent. I made the conclusion that it was not because "pad" is better than "resize", but your checkpoints were trained based on "pad" mode.

On my custom dataset, however, I found "resize" mode yields better results by 1 or 2 percents.

SunnyHaze commented 2 months ago

Thank you for your feedback.

I see your points. Generally, the deep neural network fits a distribution as a function. Thus, keeping the training distribution similar to the testing distribution is essential. Just like the issue mentioned by you "I made the conclusion that it was not because 'pad' is better than 'resize', but your checkpoints were trained based on 'pad' mode."

Further, there are many possible explanations for the performance on your custom dataset. Such as:

  1. The aspect ratio of your image is quite appropriate to 1:1. i.e. resizing operation won't twist the image.
  2. The resolution of your image is relatively larger than CASIAv2.
  3. Other issues may need to be discussed in a case study on those datasets.

Thanks again for your attention to our project. If you find the issue solved, please close the issue. You are also welcome to discuss further concerns and problems you met.