Improve image processing time

Feature request

Optimize Transformers' image_processors to decrease image processing time, and reduce inference latency for vision models and vlms.

Motivation

The Transformers library relies on PIL (Pillow) for image preprocessing, which can become a major bottleneck during inference, especially with compiled models where the preprocessing time can dominate the overall inference time.

In the examples above, the RT-DETR preprocessing necessitates only to resize the image, while the DETR one involves resize+normalize. In eager mode, image preprocessing takes a big part of the total inference time for RT-DETR, but is not the main bottleneck. However, with a compiled RT-DETR, image preprocessing takes up the majority of the inference time, underlining the necessity to optimize it. This is even clearer for DETR, where image preprocessing is already the main bottleneck in eager mode.

However, alternative libraries exist that leverage available hardware more efficiently for faster image preprocessing. OptimVision uses such libraries to get much better results compared to Transformers.

Much more details on OptimVision and image processing methods comparison are available on this Notion page.

Your contribution

OptimVision is an experiment playground to optimize the different steps involved in inferring/training with vision models. The current fast image preprocessing in OptimVision is a proof of concept and is not yet ready to be merged into Transformers, but that this the ultimate goal :).

huggingface / transformers