Open gyf304 opened 1 month ago
Thank you for the feature requet @gyf304 . I think that eventually since is something we'll want to enable.
The main challenge here isn't to implement the feature, it's to expose it in a way that isn't going to provide users with a massive footgun.
It is very important for the resizing algorithm (bilinear vs bicubic vs nearest neighbor + with or without antialiasing) to be consistent between training and inference time. When it's not, models accuracy regresses in ways that are very difficult to debug. This has caused a lot of confusion for users over time (e.g. back when the default of antialias
parameter of torchvision's Resize
wasn't consistent between PIL and Tensors).
So, if we're going to expose a resizing mechanism outsize of torchvision's Resize()
, e.g. in decode_image()
, we'll have to ensure that the new resizing implementation is consistent with what Resize()
exposes, and we should make it hard for users to end up with inconsistent resizing parameters.
@NicolasHug I accidentally fat-fingered and clicked "Comment and Close Issue" - GitHub unfortunately does not allow me to reopen this issue.
I think this concern can be mitigated by:
JPEG resize during decode is performed at the IDCT level, meaning it operates in the frequency domain. The process is somewhat comparable to applying a sinc filter*.
* This isn’t entirely accurate, as JPEG processes 8x8 blocks, whereas a true sinc filter is unbounded.
* A Lanczos filter, previously referred to as Antialias filter in Pillow, can be seen as a truncated approximation of a sinc filter.
Since JPEG resize during decode is limited to predefined scaling factors, the final output size may not precisely match the requested size_hint
.
For example, calling decode_image("image.jpg", size_hint=(224, 224))
on a JPEG image guarantees a decoded image that is at least (224, 224)
, if possible. If an exact size is required, users should follow up with Resize((224, 224))
.
It's not feasible to expect that:
resize(decode_image("image.jpg", size_hint=(224, 224)), (224, 224))
will always yield the same result as:
resize(decode_image("image.jpg"), (224, 224))
However, the difference should be minimal.
This feature, as proposed, is opt-in and does not modify how Resize()
functions. Additionally, its docstring can include a clear warning about its implications to help users make informed decisions.
Maybe also worth allowing passing these predefined scale factors directly 1/2
, 1/4
, 1/8
instead of size_hint
🚀 The feature
Torchvision's
read_image
currently decodes JPEG images at full resolution. However, bothlibjpeg
andlibjpeg-turbo
support decoding at lower resolutions (1/2, 1/4, 1/8 of the original size).Introducing a
size_hint
parameter would allow users to specify an approximate target size, withtorchvision
selecting the closest larger available scale factor and downscale the JPEG image during decoding.Example Usage:
Motivation, pitch
Image.draft
, allowing for approximate size-based decoding.Alternatives
Additional context
Benchmark
We implemented a proof-of-concept and ran performance tests on decoding a 1920x1080 image into 960x540. We compared the following:
decode_jpeg
and resize after.decode_jpeg
to allowlibjpeg
/libjpeg-turbo
downscaling via thesize_hint
parameters.Benchmark results (1000 iters):
~2.5X speed up.
I'm happy to contribute a patch if people consider this useful.