pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.78k stars 6.89k forks source link

Write text on images as an augmentation #5791

Open dmenig opened 2 years ago

dmenig commented 2 years ago

🚀 Torchvision GPU compatible text writing on images

Hi.

Right now, I believe that if you wanna write text on a GPU Tensor, you're gonna have to do it in CPU memory.

This is unfortunate since text writing is a very good augmentation is some cases, where input data might have timestamps for example. Also, most efficient loading libraries make use of the GPU (for example Nvidia Dali, decord) for efficient loading, meaning that converting a tensor back to numpy array sacrifices this advantage they have for "large" training (when you don't have the whole dataset in RAM).

I think it'd be great if writing random text on an image was a torchvision feature :D

Alternatives

Convert the tensor it back to numpy array and use OpenCV.

cc @vfdev-5 @datumbox

datumbox commented 2 years ago

@hyperfraise Thanks for the suggestion.

You are right to say that writing text on images is not currently supported. Though you make a good point about this augmentation being useful for videos with timestamp data, we don't currently support a task that could benefit significantly for by the specific augmentation. One key concern for me is that such an augmentation is very problem specific and thus it would be hard to write a generic good implementation of it that supports placing the text in multiple places, handling fonts configs etc. I could be wrong though.

@pmeier @vfdev-5 do you have any thoughts about this?

vfdev-5 commented 2 years ago

I think this request can be split into 2:

  1. draw random text on image as augmentation
  2. support this augmentation on GPU

For 1) one can use pillow: https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html#example-draw-multiline-text and we have to figure out an appropriate API for the augmentation

For 2), we can think of how to efficiently blend image and text:

in_image[gpu] --------------------------------------------------------------->(blend on gpu) --> aug image[gpu]
                                                                                      ^
text --> (generate image) --> text_image[cpu] --> (move on gpu) --> text_image[gpu] --|
pmeier commented 2 years ago

For 2), we can think of how to efficiently blend image and text:

I don't think we can blend in the traditional sense here. The text has to be in the foreground while everything else should be ignored.

vfdev-5 commented 2 years ago

@pmeier think of the image with text as mask

datumbox commented 2 years ago

@pmeier @vfdev-5 I think on the low level kernel the transform is doable; we will effectively need an API similar to CV2/PIL. Where I think is getting messy is on the Transform class side. How do we parameterize such a class and how do we provide information about:

It's going to be very hard to create a super generic class that supports all the necessary options. Perhaps what we could do is aim for a low-level kernel and let people write their own classes. But I think I would leave this work for after the new Transform API is completed.

iranroman commented 3 months ago

has there been any progress on this issue? I think being able to write text on torchvision images (i.e. torch tensors would be extremely useful).