Closed pmeier closed 2 years ago
I personally would go with a combined approach:
background
parameter to convert_image_color_space
. If this is not passed and the alpha channel is not 1 everywhere fail. I agree that for torchvision supported datasets if alpha channel is not used later by a CNN then we can strip it from the beginning. For example, as you say as ImageNet can have RGBA images, it would be better to provide only RGB images as it is done right now in the stable API. One way I could think about that is to have a preset datapipe transforms that could be deactivated:
dp = datasets.load("imagenet", decode_presets=datasets.imagenet.decode_preset)
When you say that "image reading functions support RGBA images". RGBA is a conventional as PIL "RGBA" mode or premultiplied as PIL "RGBa" ? (for differences between: https://shawnhargreaves.com/blog/premultiplied-alpha.html)
Another question, features.Image
can have ColorSpace as RGBA ?
Some thoughts on ConvertImageColorSpace("rgb")
. This means that we should be aware of the current mode and
possibly can transform to hsv, hsl and other modes (e.g. like opencv).
By the way, PIL handles correctly alpha channel on transformations, so impacted color transforms are only related to features.Image
.
As for choosing background
as 1, 255 or white, a problem can be seen on geometric transforms where default fill color is 0 or black, thus either user have to add everywhere fill color as white or they will have 2 types of no-data pixels: black and white...
possibly can transform to hsv, hsl and other modes
I don't think that's the focus right now. Maybe on the long long future but definitely not something we plan to do anytime soon.
As for choosing background as 1, 255 or white, .....
+1 on that. Given that our existing functional transforms have fill=0, does it make sense to go with black?
@vfdev-5
I agree that for torchvision supported datasets if alpha channel is not used later by a CNN then we can strip it from the beginning.
I might be pedantic here, but I want to be sure we are all on the same page: we cannot simply strip the alpha channel unless the alpha channel is all ones or 255s depending on the input type. If that is not the case, stripping the alpha channel has an effect on the actual color of the image as can be seen in the equation above.
For example, as you say as ImageNet can have RGBA images, it would be better to provide only RGB images as it is done right now in the stable API. One way I could think about that is to have a preset datapipe transforms that could be deactivated:
How would this approach know which image needs the special conversion? If we have a general approach like this, we also need to solve the conversion in a general sense.
When you say that "image reading functions support RGBA images". RGBA is a conventional as PIL "RGBA" mode or premultiplied as PIL "RGBa" ?
I don't know. @datumbox has introduced the enum. Do you have any insights here?
Another question,
features.Image
can have ColorSpace as RGBA ?
Yes. For now, we only support
But if we decide how we want to handle RGBA we can of course add it.
Some thoughts on
ConvertImageColorSpace("rgb")
. This means that we should be aware of the current mode and
That is already implemented. features.Image
's and PIL.Image.Image
's store this information. Only for vanilla tensors, the user needs to provide it manually:
possibly can transform to hsv, hsl and other modes (e.g. like opencv).
I agree with @datumbox here. While the API allows all these conversions, we shouldn't add anything here unless there is a specific user request. If we have other color spaces that are equivalent to RGB such as HSV or HSL, we definitely need a color_space
parameter on all color transforms and convert to RGB before we start applying the transform.
As for choosing
background
as 1, 255 or white, a problem can be seen on geometric transforms where default fill color is 0 or black, thus either user have to add everywhere fill color as white or they will have 2 types of no-data pixels: black and white...
The alpha channel is different from a fill color. Choosing alpha all white means no transparency and thus the background is irrelevant. For our all of our current datasets, RGBA images are outliers that are most likely not intended by the authors. They happen if the images are scraped from the internet and the person who uploaded them made a mistake to include an alpha channel although there was no need for it. For example, this is the RGBA image from ImageNet (in thumbnail size to limit copyright implications) including the alpha channel:
How would this approach know which image needs the special conversion? If we have a general approach like this, we also need to solve the conversion in a general sense.
Maybe, I'm missing something important, but for me, everything could just follow the same logic as actual API does with Pillow images: when image is read and decoded you'll have a color space. Prior info is that for imagenet we provide RGB only images. So, all images with non RGB
space should be converted to RGB
.
So, all images with non
RGB
space should be converted toRGB
.
That is the crucial point I was touching on in this issue. How are we going to implement the conversion from RGBA to RGB? Not all image files carry information about the background color in the first place. The image above is one of them. In such a case PIL falls back to Floyd-Steinberg dithering.
Not only would we need to implement custom kernels for that, but dithering might also significantly reduce the image quality. Compare this test image
.convert("RGB")
'ed by PIL with Floyd-Steinberg
or converted by the formula above assuming a white background
So the two key approaches discussed here are:
What is the approach used by CV2?
I'll dig in what PIL is exactly doing, because what I failed to realize is that they also seem to assume a white background for the dithering.
@pmeier Can you check also CV2? If they do the same, I think it would make sense to make the same assumption on TorchVision.
@vfdev-5 Thoughts?
I'm not 100% sure, but my best guess is that white is also assumed. I've tracked it to this OpenCL call. There are multiple issues on the OpenCV repo such as opencv/opencv#13135 that report that the conversion does not take alpha into account. I think the consensus is that one should not use cv2.cvtColor
for RGBA -> RGB conversions, but rather blend if the alpha actually contains information. Ignoring the alpha channel is effectively the same than setting it to white.
skimage has an background
parameter on their rgba2rgb()
conversion function. It defaults to white.
Thus, I think we are fairly safe if we also assume white and provide an option to set it manually if required.
After some offline discussion, we decided to align with PIL for now. The only difference should be that we should fail the transformation if the alpha channel is not the max value everywhere. This way we can implement the correct conversion as detailed in my top comment later without worrying about BC.
@pmeier You mean rgba2rgb method: https://github.com/python-pillow/Pillow/blob/c58d2817bc891c26e6b8098b8909c0eb2e7ce61b/src/libImaging/Convert.c#L443-L453 right ?
How do you see that it assumes white background ?
From the implementation we can say that pointers in
and out
has 4 values per pixel. By the last in++
we ignore the forth value and by *out++ = 255;
we set output forth value as 255.
Pixel size is 4 for RGB: https://github.com/python-pillow/Pillow/blob/95cff6e959bb3c37848158ed2145d49d49806a31/src/libImaging/Storage.c#L125-L129
In such a case PIL falls back to Floyd-Steinberg dithering.
In PIL docs they say to use dithering for conversion like RGB to P or 1 https://github.com/python-pillow/Pillow/blob/92c26a77ca53a2bfbd8804f009c6c8755d0e5a43/src/PIL/Image.py#L921-L922 same for the code:
You are right. PIL also seems to only "strip" the alpha channel by setting it to 255 everywhere:
import PIL.Image
import torch
from torchvision.transforms.functional import pil_to_tensor
image = PIL.Image.open("text2.png")
a = pil_to_tensor(image)[:3, ...]
b = pil_to_tensor(image.convert("RGB"))
torch.testing.assert_close(a, b)
Thus, the behavior is aligned with OpenCV. This doesn't change the plan from https://github.com/pytorch/vision/issues/5510#issuecomment-1061529561.
You are right. PIL also seems to only "strip" the alpha channel by setting it to 255 everywhere:
import PIL.Image import torch from torchvision.transforms.functional import pil_to_tensor image = PIL.Image.open("text2.png") a = pil_to_tensor(image)[:3, ...] b = pil_to_tensor(image.convert("RGB")) torch.testing.assert_close(a, b)
Thus, the behavior is aligned with OpenCV. This doesn't change the plan from #5510 (comment).
Isn't "strip" the alpha channel means assuming a black background? If I take a RGBA image with transparent background and call Image.convert(), it will become RGB with black background. Which is different from white background acchieved from alpha_composite().
@bsun0802 Could you share the image so I can take a look?
@pmeier
Sorry, I was a little slow this morning. Note that the color what you call "background color", i.e. the black region around the people, is not the same as the "background" we are talking here, but rather foreground.
In case there is transparency in the image, the background color is determined by the canvas you are drawing on. In your case that is white from the notebook. If you put the same image on a green canvas, the background will be green. Meaning, you are using the alpha channel here to mask out the black background the image originally has.
This issue is about what we want to do in case we are forced to convert from RGBA to RGB, but don't have canvas to draw on. PIL
just strips the alpha channel and thus revealing all colors in the image that have been masked previously, i.e. the black regions in your case (note that there is also some white in the top left corner that you cannot see if you display the image on a white canvas).
This discussion started in https://github.com/pytorch/vision/pull/5500#discussion_r816503203 and @vfdev-5 and I continued offline.
PIL as well as our image reading functions support RGBA images
https://github.com/pytorch/vision/blob/95d418970e6dbf2e4d928a204c4e620da7bccdc0/torchvision/io/image.py#L16-L31
but our color transformations currently only support RGB images ignoring an extra alpha channel. This leads to wrong results. One thing that we agreed upon is that these transforms should fail if anything but 3 channels is detected.
Still, some datasets include non-RGB images so we need to deal with this for a smooth UX. Previously we implicitly converted every image to RGB before returning it from a dataset
https://github.com/pytorch/vision/blob/f9fbc104c02f277f9485d9f8727f3d99a1cf5f0b/torchvision/datasets/folder.py#L245-L249
Since we no longer decode images in the datasets, we need to provide a solution for the users here. I currently see two possible options:
tensorflow-datasets
uses this approach: https://github.com/tensorflow/datasets/blob/a1caff379ed3164849fdefd147473f72a22d3fa7/tensorflow_datasets/image_classification/imagenet.py#L105-L131The most common non-RGB image in datasets are grayscale images. For example, the train split of ImageNet contains 19970 grayscale images. Thus, the users will need a
transforms.ConvertImageColorSpace("rgb")
in most cases anyway. If that would support RGBA to RGB conversions the problem would also be solved. The conversion happens with this formula:where
pixel_{old|new}
is a single value from a color channel. Since we don't knowbackground
we need to either make assumptions or require the user to provide a value for it. I'd wager a guess that in 99% of the cases the background is white. i.e.background == 1
, but we can't be sure about that.Another issue with this is that the user has no option to set the background on a per-image basis in the transforms pipeline if that is needed.
In special case for
alpha == 1
everywhere, the equation above simplifies towhich is equivalent to stripping the alpha channel. We could check for that and only perform the RGBA to RGB transform if the condition holds or the user supplies a background color.
cc @pmeier @vfdev-5 @datumbox @bjuncek