pytorch / vision

Datasets, Transforms and Models specific to Computer Vision

https://pytorch.org/vision

BSD 3-Clause "New" or "Revised" License

15.71k stars 6.87k forks source link

[BC-breaking] Aligning `fill = None` with `fill = 0` on Transforms #6623

Open datumbox opened 1 year ago

datumbox commented 1 year ago

🚀 The feature

As discussed at https://github.com/pytorch/vision/issues/6517, passing None on fill for some transforms isn't the same as passing 0. The difference applies on some cases but not others, it was potentially done to avoid JIT issues (or is it a bug?) and has unforeseen effects on the behaviour of the Transforms.

Currently on the prototype area:

All functionals have fill=None as default value.
All Transforms have fill=0 as default.
All AutoAugment Transforms have fill=None.

The above inconsistent behaviour maintains BC with stable but leads to a weird API.

Motivation, pitch

Break BC on the functional level and make fill=None behave similar to fill=0.

This unfortunately will have accuracy effects on models trained with the Tensor API. Worth noting that such a change would align closer with PIL which already sets fill=None to 0: https://github.com/pytorch/vision/blob/56e707bfccb62ada836d21e431d6db0d10dd73a1/torchvision/transforms/functional_pil.py#L264-L265

This proposal could be adopted if we decide to break BC and align the Tensor and PIL backends on other aspects (such as antialias etc).

Alternatives

There are 2 other potential alternatives:

Keep the Transforms and functionals as-is and make no breaking changes.
Make a breaking change on the Transforms level (AutoAugment) but keep the functionals as-is. This will minimize the effect of the change on model inference for pre-trained models but maintain the discrepancy between the 2 backends.

Additional context

No response

cc @vfdev-5

NicolasHug commented 1 year ago

Trying to understand this more, is the problem that:

fill has different defaults across APIs?
fill=0 is different from fill=None?
both?

If we were to have fill with the same default across APIs, would we still have a problem?

vfdev-5 commented 1 year ago

Trying to understand this more, is the problem that:

-> answer is both

3rd alternative is: to try to put fill=0 in functional ops and make fill=0 behave as fill=None.

Checking the codebase history, I have an impression is that default fill 0 or None has its root in PIL API:

Fill argument exists for pad, affine, rotate and others affine-like functional ops.

1) For PIL input pad invokes ImageOps.expand with fill which can't have None value -> thus fill is 0 by default in torchvision for pad

2) affine invokes PIL Image.transform with fill which can take None or 0 and produces same result with 0 pixels. PR with affine in torchvision had fill=None by default

3) rotate invokes PIL Image.rotate with fill which can take None or 0 and produces same result with 0 pixels. rotate was with fill=0 for some time and then was fixed to None (https://github.com/pytorch/vision/pull/1760)

So, making output result for all the ops the same if fill=None and fill=0, would be a move forward. Then we can pick None or 0 as default value and make API uniform on that...

datumbox commented 1 year ago

@vfdev-5 Thanks for the comprehensive reply. Do you think I should add the alternative as a 3rd option? Or perhaps I could just edit the "proposal" to suggest that we should be using 0 as defaults and just handle None similarly? My understanding is that the 2 proposals are similar and it's a matter of which default value we choose to expose (I agree that 0 is probably better).

NicolasHug commented 1 year ago

Thanks for the details @vfdev-5 .

passing None on fill for some transforms isn't the same as passing 0

I'm looking at https://github.com/pytorch/vision/issues/6517 but I can't pin-point exactly what the difference is. Would you have more details? Is it that the "filling" simply doesn't happen when fill=None, or is it something more subtle than that?

Also instead of a BC breaking change, is deprecation an option here?

vfdev-5 commented 1 year ago

I'm looking at https://github.com/pytorch/vision/issues/6517 but I can't pin-point exactly what the difference is. Would you have more details? Is it that the "filling" simply doesn't happen when fill=None, or is it something more subtle than that?

Yeah, this issue description may be unclear. The problem arises when we call _apply_grid_transform with mode="bilinear" + 1) fill=None and 2) fill=0. Two outputs wont match. I assume that outputs should match because for me fill=None and fill=0 is the same configuration. Also, by the way outputs are matching if mode="nearest" + similar configs for fill.

Why results are not matching for bilinear ?

Because when fill is None we do only grid sampling (which does padding_mode="zeros"). If fill=0 we create additional mask channel, concat it to the input image, then grid sample, extract transformed mask and apply it with fill values (which are 0 again) to the output image. The way how we blend mask with fill value and the image is probably incorrect for bilinear mode (see images in the issue, there is a black shadow around rotated image for current torchvision. This shadow is absent for PIL). So, the problem comes from blending zero fill values to already transformed image with zero fill.

Before fixing blending, we may ask why we bother doing all this mask fancy stuff for fill=0 if grid sample already does that for us. So, I would expect that fill=0 should be a no-op like fill=None in terms of computing mask and manually applying fill value.

Thus, point 1 in the issue is about making fill=0 option as like fill=None without computing any fill mask. Point 2 is about doing something with fill value mask blending for mode=bilinear.

NicolasHug commented 1 year ago

Just had a chat with @vfdev-5 and @pmeier , I'll try to briefly summarize the situation on V1 transforms for my own understanding (possibly re-hashing what has already been said above). These are general statements that do not necessarily apply to all transforms in all cases (sometimes, only a subset of them):

fill has inconsistent default values across transforms (None for AA, 0 for the rest). Functionals mostly have None as default except for Pad (0).
fill may have a different default value between the transforms and their corresponding functionals. E.g. Perspective (0 in transforms, None in functional).
passing the same fill value to a transform and to a functional may lead to different results (because some transforms convert None to 0, which is an undocumented behaviour)
on the functionals, passing fill=0 or fill=any_color may result in different results between PIL and Tensors (passing fill=None should result in consistent results). Whether those differences are within the range of acceptable differences or not is unclear (to me).

All of these are issues we'll need to fix eventually. We don't know how yet.

As discussed in https://github.com/pytorch/vision/pull/7258, the V2 transform introduce a minor BC-breaking change by not doing this anymore:

some transforms convert None to 0, which is an undocumented behaviour

I.e. if users pass None to a transform, the functional will now get None as well in V2. This allows users who explicitly pass None to get consistent results between PIL and Tensor backends. It's slightly BC because this may lead to different results (compared to V1) for users who a) were passing None explicitly which was never documented and b) were using the Tensor backend. The PIL backend is unchanged because the pil functionals always do the None->0 conversion anyway.

pmeier commented 1 year ago

on the functionals, passing fill=0 or fill=any_color may result in different results between PIL and Tensors (passing fill=None should result in consistent results). Whether those differences are within the range of acceptable differences or not is unclear (to me).

To narrow it down a little, this only happens for InterpolationMode.BILINEAR. In general it applies to all interpolation modes that can create intermediate values, e.g. bicubic, but the affine functions are limited to nearest and bilinear interpolation:

https://github.com/pytorch/vision/blob/a192c95e77a4a4de3a8aeee45130ddc4d2773a83/torchvision/transforms/v2/functional/_geometry.py#L567

6517 goes into details, but here is the TL;DR: PyTorch's `grid_sample` does not support filling by any color and is equivalent to `fill=0`. If the user requests any fill color, we apply the same affine transformation to a boolean mask and use that to apply the filling later. However for bilinear interpolation this creates values besides `0` and `1` and so we have to decide how we want to deal with this. Current implementation accepts these extra values and uses them to blend between the fill color and the original image. However this creates shadow artifacts.