pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.25k stars 6.96k forks source link

[RFC] How do we want to deal with images that include alpha channels? #5510

Closed pmeier closed 2 years ago

pmeier commented 2 years ago

This discussion started in https://github.com/pytorch/vision/pull/5500#discussion_r816503203 and @vfdev-5 and I continued offline.

PIL as well as our image reading functions support RGBA images

https://github.com/pytorch/vision/blob/95d418970e6dbf2e4d928a204c4e620da7bccdc0/torchvision/io/image.py#L16-L31

but our color transformations currently only support RGB images ignoring an extra alpha channel. This leads to wrong results. One thing that we agreed upon is that these transforms should fail if anything but 3 channels is detected.

Still, some datasets include non-RGB images so we need to deal with this for a smooth UX. Previously we implicitly converted every image to RGB before returning it from a dataset

https://github.com/pytorch/vision/blob/f9fbc104c02f277f9485d9f8727f3d99a1cf5f0b/torchvision/datasets/folder.py#L245-L249

Since we no longer decode images in the datasets, we need to provide a solution for the users here. I currently see two possible options:

  1. We could deal with this on a per-image basis within the dataset. For example, the train split of ImageNet contains a single RGBA image. We could simply perform an appropriate conversion for irregular image modes in the dataset so this issue is abstracted away from the user. tensorflow-datasets uses this approach: https://github.com/tensorflow/datasets/blob/a1caff379ed3164849fdefd147473f72a22d3fa7/tensorflow_datasets/image_classification/imagenet.py#L105-L131
  2. The most common non-RGB image in datasets are grayscale images. For example, the train split of ImageNet contains 19970 grayscale images. Thus, the users will need a transforms.ConvertImageColorSpace("rgb") in most cases anyway. If that would support RGBA to RGB conversions the problem would also be solved. The conversion happens with this formula:

    pixel_new = (1 - alpha) * background + alpha * pixel_old

    where pixel_{old|new} is a single value from a color channel. Since we don't know background we need to either make assumptions or require the user to provide a value for it. I'd wager a guess that in 99% of the cases the background is white. i.e. background == 1, but we can't be sure about that.

    Another issue with this is that the user has no option to set the background on a per-image basis in the transforms pipeline if that is needed.

    In special case for alpha == 1 everywhere, the equation above simplifies to

    pixel_new = pixel_old

    which is equivalent to stripping the alpha channel. We could check for that and only perform the RGBA to RGB transform if the condition holds or the user supplies a background color.

cc @pmeier @vfdev-5 @datumbox @bjuncek

pmeier commented 2 years ago

I personally would go with a combined approach:

  1. If we are aware of any images in a dataset outside of what we can convert automatically without assumptions, the dataset should handle these images.
  2. Add a background parameter to convert_image_color_space. If this is not passed and the alpha channel is not 1 everywhere fail.
vfdev-5 commented 2 years ago

I agree that for torchvision supported datasets if alpha channel is not used later by a CNN then we can strip it from the beginning. For example, as you say as ImageNet can have RGBA images, it would be better to provide only RGB images as it is done right now in the stable API. One way I could think about that is to have a preset datapipe transforms that could be deactivated:

dp = datasets.load("imagenet", decode_presets=datasets.imagenet.decode_preset)

When you say that "image reading functions support RGBA images". RGBA is a conventional as PIL "RGBA" mode or premultiplied as PIL "RGBa" ? (for differences between: https://shawnhargreaves.com/blog/premultiplied-alpha.html)

Another question, features.Image can have ColorSpace as RGBA ?

Some thoughts on ConvertImageColorSpace("rgb"). This means that we should be aware of the current mode and possibly can transform to hsv, hsl and other modes (e.g. like opencv).

By the way, PIL handles correctly alpha channel on transformations, so impacted color transforms are only related to features.Image.

As for choosing background as 1, 255 or white, a problem can be seen on geometric transforms where default fill color is 0 or black, thus either user have to add everywhere fill color as white or they will have 2 types of no-data pixels: black and white...

datumbox commented 2 years ago

possibly can transform to hsv, hsl and other modes

I don't think that's the focus right now. Maybe on the long long future but definitely not something we plan to do anytime soon.

As for choosing background as 1, 255 or white, .....

+1 on that. Given that our existing functional transforms have fill=0, does it make sense to go with black?

pmeier commented 2 years ago

@vfdev-5

I agree that for torchvision supported datasets if alpha channel is not used later by a CNN then we can strip it from the beginning.

I might be pedantic here, but I want to be sure we are all on the same page: we cannot simply strip the alpha channel unless the alpha channel is all ones or 255s depending on the input type. If that is not the case, stripping the alpha channel has an effect on the actual color of the image as can be seen in the equation above.

For example, as you say as ImageNet can have RGBA images, it would be better to provide only RGB images as it is done right now in the stable API. One way I could think about that is to have a preset datapipe transforms that could be deactivated:

How would this approach know which image needs the special conversion? If we have a general approach like this, we also need to solve the conversion in a general sense.

When you say that "image reading functions support RGBA images". RGBA is a conventional as PIL "RGBA" mode or premultiplied as PIL "RGBa" ?

I don't know. @datumbox has introduced the enum. Do you have any insights here?

Another question, features.Image can have ColorSpace as RGBA ?

Yes. For now, we only support

https://github.com/pytorch/vision/blob/71d2bb0bc67044f55d38bfddf04e05be0343deab/torchvision/prototype/features/_image.py#L16-L19

But if we decide how we want to handle RGBA we can of course add it.

Some thoughts on ConvertImageColorSpace("rgb"). This means that we should be aware of the current mode and

That is already implemented. features.Image's and PIL.Image.Image's store this information. Only for vanilla tensors, the user needs to provide it manually:

https://github.com/pytorch/vision/blob/71d2bb0bc67044f55d38bfddf04e05be0343deab/torchvision/prototype/transforms/_meta.py#L55-L78

possibly can transform to hsv, hsl and other modes (e.g. like opencv).

I agree with @datumbox here. While the API allows all these conversions, we shouldn't add anything here unless there is a specific user request. If we have other color spaces that are equivalent to RGB such as HSV or HSL, we definitely need a color_space parameter on all color transforms and convert to RGB before we start applying the transform.

As for choosing background as 1, 255 or white, a problem can be seen on geometric transforms where default fill color is 0 or black, thus either user have to add everywhere fill color as white or they will have 2 types of no-data pixels: black and white...

The alpha channel is different from a fill color. Choosing alpha all white means no transparency and thus the background is irrelevant. For our all of our current datasets, RGBA images are outliers that are most likely not intended by the authors. They happen if the images are scraped from the internet and the person who uploaded them made a mistake to include an alpha channel although there was no need for it. For example, this is the RGBA image from ImageNet (in thumbnail size to limit copyright implications) including the alpha channel:

imagenet_rgba

vfdev-5 commented 2 years ago

How would this approach know which image needs the special conversion? If we have a general approach like this, we also need to solve the conversion in a general sense.

Maybe, I'm missing something important, but for me, everything could just follow the same logic as actual API does with Pillow images: when image is read and decoded you'll have a color space. Prior info is that for imagenet we provide RGB only images. So, all images with non RGB space should be converted to RGB.

pmeier commented 2 years ago

So, all images with non RGB space should be converted to RGB.

That is the crucial point I was touching on in this issue. How are we going to implement the conversion from RGBA to RGB? Not all image files carry information about the background color in the first place. The image above is one of them. In such a case PIL falls back to Floyd-Steinberg dithering.

Not only would we need to implement custom kernels for that, but dithering might also significantly reduce the image quality. Compare this test image

.convert("RGB")'ed by PIL with Floyd-Steinberg

dithered

or converted by the formula above assuming a white background

converted

datumbox commented 2 years ago

So the two key approaches discussed here are:

  1. Handle RGBA to RGB conversion exactly like PIL using Floyd-Steinberg
  2. Use the formula provided by Philip and assume white background

What is the approach used by CV2?

pmeier commented 2 years ago

I'll dig in what PIL is exactly doing, because what I failed to realize is that they also seem to assume a white background for the dithering.

pmeier commented 2 years ago

PIL assumes a white background.

datumbox commented 2 years ago

@pmeier Can you check also CV2? If they do the same, I think it would make sense to make the same assumption on TorchVision.

@vfdev-5 Thoughts?

pmeier commented 2 years ago

CV2

I'm not 100% sure, but my best guess is that white is also assumed. I've tracked it to this OpenCL call. There are multiple issues on the OpenCV repo such as opencv/opencv#13135 that report that the conversion does not take alpha into account. I think the consensus is that one should not use cv2.cvtColor for RGBA -> RGB conversions, but rather blend if the alpha actually contains information. Ignoring the alpha channel is effectively the same than setting it to white.

skimage

skimage has an background parameter on their rgba2rgb() conversion function. It defaults to white.

Thus, I think we are fairly safe if we also assume white and provide an option to set it manually if required.

pmeier commented 2 years ago

After some offline discussion, we decided to align with PIL for now. The only difference should be that we should fail the transformation if the alpha channel is not the max value everywhere. This way we can implement the correct conversion as detailed in my top comment later without worrying about BC.

vfdev-5 commented 2 years ago

PIL assumes a white background.

@pmeier You mean rgba2rgb method: https://github.com/python-pillow/Pillow/blob/c58d2817bc891c26e6b8098b8909c0eb2e7ce61b/src/libImaging/Convert.c#L443-L453 right ?

How do you see that it assumes white background ? From the implementation we can say that pointers in and out has 4 values per pixel. By the last in++ we ignore the forth value and by *out++ = 255; we set output forth value as 255.

Pixel size is 4 for RGB: https://github.com/python-pillow/Pillow/blob/95cff6e959bb3c37848158ed2145d49d49806a31/src/libImaging/Storage.c#L125-L129

In such a case PIL falls back to Floyd-Steinberg dithering.

In PIL docs they say to use dithering for conversion like RGB to P or 1 https://github.com/python-pillow/Pillow/blob/92c26a77ca53a2bfbd8804f009c6c8755d0e5a43/src/PIL/Image.py#L921-L922 same for the code:

https://github.com/python-pillow/Pillow/blob/c58d2817bc891c26e6b8098b8909c0eb2e7ce61b/src/libImaging/Convert.c#L1572-L1574

https://github.com/python-pillow/Pillow/blob/c58d2817bc891c26e6b8098b8909c0eb2e7ce61b/src/libImaging/Convert.c#L1608-L1610

pmeier commented 2 years ago

You are right. PIL also seems to only "strip" the alpha channel by setting it to 255 everywhere:

import PIL.Image
import torch
from torchvision.transforms.functional import pil_to_tensor

image = PIL.Image.open("text2.png")
a = pil_to_tensor(image)[:3, ...]
b = pil_to_tensor(image.convert("RGB"))

torch.testing.assert_close(a, b)

Thus, the behavior is aligned with OpenCV. This doesn't change the plan from https://github.com/pytorch/vision/issues/5510#issuecomment-1061529561.

bsun0802 commented 1 year ago

You are right. PIL also seems to only "strip" the alpha channel by setting it to 255 everywhere:

import PIL.Image
import torch
from torchvision.transforms.functional import pil_to_tensor

image = PIL.Image.open("text2.png")
a = pil_to_tensor(image)[:3, ...]
b = pil_to_tensor(image.convert("RGB"))

torch.testing.assert_close(a, b)

Thus, the behavior is aligned with OpenCV. This doesn't change the plan from #5510 (comment).

Isn't "strip" the alpha channel means assuming a black background? If I take a RGBA image with transparent background and call Image.convert(), it will become RGB with black background. Which is different from white background acchieved from alpha_composite().

image

pmeier commented 1 year ago

@bsun0802 Could you share the image so I can take a look?

bsun0802 commented 1 year ago

@pmeier GDsO1wOe_31QL2AEAAAAAABOhk1Rbs0HAAAD

pmeier commented 1 year ago

Sorry, I was a little slow this morning. Note that the color what you call "background color", i.e. the black region around the people, is not the same as the "background" we are talking here, but rather foreground.

In case there is transparency in the image, the background color is determined by the canvas you are drawing on. In your case that is white from the notebook. If you put the same image on a green canvas, the background will be green. Meaning, you are using the alpha channel here to mask out the black background the image originally has.

This issue is about what we want to do in case we are forced to convert from RGBA to RGB, but don't have canvas to draw on. PIL just strips the alpha channel and thus revealing all colors in the image that have been masked previously, i.e. the black regions in your case (note that there is also some white in the top left corner that you cannot see if you display the image on a white canvas).