huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.09k stars 26.31k forks source link

Accuracy regression of ViT #28264

Closed blzheng closed 7 months ago

blzheng commented 8 months ago

System Info

Who can help?

@amyeroberts

Information

Tasks

Reproduction

Accuracy regression caused by https://github.com/huggingface/transformers/pull/19796

Reproduce command: python transformers/examples/pytorch/image-classification/run_image_classification.py --model_name_or_path google/vit-base-patch16-224 --do_eval --dataset_name imagenet-1k --per_device_eval_batch_size 1 --remove_unused_columns False --output_dir ./

Expected behavior

Expected results:
eval_accuracy = 0.8131 eval_loss = 0.7107 eval_runtime = 0:43:40.30 eval_samples_per_second = 19.082 eval_steps_per_second = 19.082 Current results: eval_accuracy = 0.8033 eval_loss = 0.755 eval_runtime = 0:34:05.81 eval_samples_per_second = 24.44 eval_steps_per_second = 0.436

amyeroberts commented 8 months ago

Hi @blzheng, thanks for raising this issue!

19796 has been merged in for over a year now, and there have been a few subsequent updates to the image processing logic. Could you confirm how you narrowed it down to this commit?

What performance do you get, running on main with different seeds?

blzheng commented 8 months ago

Hi @amyeroberts , we observed accuracy drop from 0.8131 (transformers==4.18.0) to 0.8033 (transformers==4.28.1), then I narrowed down to this commit with git bisect. This issue can be reproduced stably by running the following command. Even with the latest codebase, this issue still exists. "python transformers/examples/pytorch/image-classification/run_image_classification.py --model_name_or_path google/vit-base-patch16-224 --do_eval --dataset_name imagenet-1k --per_device_eval_batch_size 1 --remove_unused_columns False --output_dir ./"

amyeroberts commented 8 months ago

@blzheng Thanks for confirming.

The reason for the change is because the processing logic in the image classification script was updated to reflect that of the model's image processor.

Previously, size could be an int, and was passed directly to torchvision.transforms.Resize. If size is an int (which it is for many model's e.g. here for a vit checkpoint), then the shortest edge of the image is resized to size and the other edge rescaled to keep the image's aspect ratio.

However, in the now-deprecated feature extractors (superceeded in #19796), the default behaviour if size was an int, was to resize the image to (size, size). This was the case of ViT.

The script now reflects the behaviour of the image processor, even when using torchvision transforms.

blzheng commented 8 months ago

@amyeroberts Thanks for your information. Now that the changes in image processing logic are reasonable, does it mean the accuracy drop is expected?

amyeroberts commented 8 months ago

@blzheng It depends what you mean by "expected". The change in the logic means that the aspect ratio of the input images is different, and so one would expect there to be a performance difference. Even though it's not in-line with the processing of the model's image processors, the previous processing might bring better performance because it preserves the true aspect of images and hence shape/dimensions of the subjects in the image (this is speculation).

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.