pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.77k stars 6.89k forks source link

[FEEDBACK] Transforms V2 API #6753

Closed datumbox closed 5 months ago

datumbox commented 1 year ago

🚀 The feature

This issue is dedicated for collecting community feedback on the Transforms V2 API. Please review the dedicated blogpost where we describe the API in detail and provide an overview of its features.

We would love to get your thoughts, comments and input in order to improve the API and graduate it from prototype on the near future.

Please also check out https://github.com/pytorch/vision/issues/7319 where we collect feedback on some specific design decision, and document as well which APIs may change in the future!


Code example using this image:

import PIL
from torchvision import io, utils
from torchvision.prototype import features, transforms as T
from torchvision.prototype.transforms import functional as F

# Defining and wrapping input to appropriate Tensor Subclasses
path = "COCO_val2014_000000418825.jpg"
img = features.Image(io.read_image(path), color_space=features.ColorSpace.RGB)
# img = PIL.Image.open(path)
bboxes = features.BoundingBox(
    [[2, 0, 206, 253], [396, 92, 479, 241], [328, 253, 417, 332],
     [148, 68, 256, 182], [93, 158, 170, 260], [432, 0, 438, 26],
     [422, 0, 480, 25], [419, 39, 424, 52], [448, 37, 456, 62],
     [435, 43, 437, 50], [461, 36, 469, 63], [461, 75, 469, 94],
     [469, 36, 480, 64], [440, 37, 446, 56], [398, 233, 480, 304],
     [452, 39, 463, 63], [424, 38, 429, 50]],
    format=features.BoundingBoxFormat.XYXY,
    spatial_size=F.get_spatial_size(img),
)
labels = features.Label([59, 58, 50, 64, 76, 74, 74, 74, 74, 74, 74, 74, 74, 74, 50, 74, 74])

# Defining and applying Transforms V2
trans = T.Compose(
    [
        T.ColorJitter(contrast=0.5),
        T.RandomRotation(30),
        T.CenterCrop(480),
    ]
)
img, bboxes, labels = trans(img, bboxes, labels)

# Visualizing results
viz = utils.draw_bounding_boxes(F.to_image_tensor(img), boxes=bboxes)
F.to_pil_image(viz).show()
pmeier commented 1 year ago

I think I found the lines that were raising the ValueError. As expected, they are checking for type equality instead of using isinstance. That is fairly uncommon and nothing that we can support from our side since the datapoints being subclasses of torch.Tensor is a core design decision of transforms v2.

You'll have the ask the torchmetrics team why the check is so restrictive and if it is possible to relax it. As a workaround, you can always unwrap the datapoints with .as_subclass(torch.Tensor) into a plain tensor. Since you likely don't want to transform your sample further when you are computing the MAP, this should have no downside for you.

akors commented 1 year ago

Sorry if this question is trivial or has already been answered, but I have read the following documentation but couldn't find an answer:

How does one write custom transforms using transforms.v2? I could not find any examples or documentation on the topic. I guess there's the Transforms class, but that has no public documentation that I could find and looks like a private interface.

Specifically, what is the canonical way to decide which transforms apply to which type of Datapoint? I hacked it together using isinstance checks for the Datapoint types, but this feels very brittle and weird and it does not handle Tensor (non-Datapoint) types at all.

class MyTransform:
    def __call__(self, sample):
        if isinstance(sample, typing.Sequence):
            # apply to all elements if element is a sequence
            return tuple(self.do(i) for i in sample)
        else:
            return self.do(sample)

    def do(self, image):
        if isinstance(image, torch.Tensor):
            w_1, h_1 = image.shape[-2:]
        elif isinstance(image, PIL.Image.Image):
            w_1, h_1 = image.size

        # ...

        if isinstance(image, torchvision.datapoints.Mask):
            #disable interpolation
            interpolation = T.InterpolationMode.NEAREST_EXACT
            antialias = False

       # ...

If it were possible to specify in the Compose constructor which of the transforms will apply, instead of having to code type checking logic into the transforms themselves, that would already be helpful.

pmeier commented 1 year ago

@akors

I could not find any examples or documentation on the topic. I guess there's the Transforms class, but that has no public documentation that I could find and looks like a private interface.

That is indeed the case right now and you haven't missed anything. We are not 100% sure about what parts we want to expose and this is why the methods are private. If you don't have a problem to do some slight changes later on, here is the rundown:

So in your case, the transformation could look like

class MyTransform(transforms.Transform):
    def _get_params(self, flat_inputs):
        spatial_size = query_spatial_size(flat_inputs)
        return dict(spatial_size=spatial_size)

    def _transform(self, inpt, params):
        w_1, h_1 = params["spatial_size"]

        if isinstance(image, torchvision.datapoints.Mask):
            #disable interpolation
            interpolation = T.InterpolationMode.NEAREST_EXACT
            antialias = False

       # ...
nlgranger commented 1 year ago

Hi,

Is there a summary of all reasonably common use-cases / transformation pipelines somewhere ? It would help to evaluate more systematically how well each proposition addresses them, both in terms of performance , ease of use and legibility.

For instance, the datapoint class wrappers make it easier to write simple pipelines, but rather cumbersome for pairwise or batched transforms.

Also the new api doesn't make it easier to fuse consecutive transforms. Implementing any kind of jit support and then fusion will require a lot of low-level code and doesn't play nice with anything other than Tensors.

In most projects with non-trivial transforms, I think it's just a lot simpler and more efficient to simply define a custom MyTransform module which does the following in its forward:

  1. generate transformation descriptors -> available via torchvision get_params
  2. fuse (ex: merge affine-like transforms such as crop, rotate, flip) -> not available in torchvision
  3. apply transforms -> available via torchvision functional

It is very flexible, easy to profile and optimize, more explicit than datapoints (no function dispatch, no implicit conversions, no implicit batch behavior), and still relatively concise and fast to write.

The proposed api already allows that: one can mix Transform and the functional API, but I suggest the focus of v2 updates be given to make it leaner and easier.

For instance, by making get_params results more central in the API design and by adding glue functions, ex:

On a side note, here are a few missing functionalities that could be useful:

pmeier commented 1 year ago

Hey @nlgranger and sorry for the delay. I was OOO for two weeks.

For instance, the datapoint class wrappers make it easier to write simple pipelines, but rather cumbersome for pairwise or batched transforms.

Unless we are talking past each other "pairwise" and "batched" are two very different things.

Also the new api doesn't make it easier to fuse consecutive transforms. Implementing any kind of jit support and then fusion will require a lot of low-level code and doesn't play nice with anything other than Tensors.

That is true, but this was no a use case we were trying to improve. In general all the v2 transforms are not JIT scriptable with torchscript, although we added support for it on all transforms that were already present in v1. We are still committing to keeping JIT scriptability for the functional API though. What v2 improves here, is that we expose the kernels, i.e. the lowest-level implementation, for each "datapoint" explicitly. Meaning, if you have the F.resize dispatcher, you also have the F.resize_image_tensor, F.resize_bounding_box, ... kernels that only support plain tensors and can be used to fuse operations.

  1. generate transformation descriptors -> available via torchvision get_params

It was a intentional design decision to no longer need this in v2. The only reason that some v1 transforms had a public and static get_params method is that it had no way to support joint transforms for more than one image. Thus, for non-classification tasks, one had to do something like

https://github.com/pytorch/vision/blob/8324c481dd4c3096697332d76fbdc9d912f7360b/references/segmentation/transforms.py#L61-L63

Core design of v2 is to get rid of this limitation and support joint transformations natively. Thus, there is no general support for a public, static get_params anymore. For BC, it is still available on the transforms that had it on v1

https://github.com/pytorch/vision/blob/8324c481dd4c3096697332d76fbdc9d912f7360b/torchvision/transforms/v2/_transform.py#L103-L107

For instance, by making get_params results more central in the API design and by adding glue functions, ex:

  • transform params to transform matrix

  • transform params fusion

  • transform matrix to affine_grid conversion

  • etc.

Correct me if I'm wrong here, but this seems like a recipe for affine transformations only. Since they are only a subset of the transformations we provide, we can't support that in general.

Carry over a viewport information which tracks the position of the original image in the canvas after zoom out, crop, affine, etc.

Could you elaborate on that? What exactly do you want to track? It makes somewhat sense to me for zoom-out and affine, but not so much for crop. You can create a bounding box that tracks that for you

image = ...
height, width = image.shape[-2:]
bounding_box = datapoints.BoundingBox(
    [0, 0, width, height], 
    format=datapoint.BoundingBoxFormat.XYWH,
    spatial_size=(height, width),
)
transform(image, bounding_box)

After the transformation, the new bounding box should contain the information you wanted.

Masks to long Tensor conversion

Could you elaborate here?

nlgranger commented 1 year ago

@pmeier Thank you very much for the detailed answer, it shines some light on the thought process behind the new API.

Unless we are talking past each other

We are ;-) , let me rephrase:

My concern about the fusion of successive ops remains: it will be hard to implement a fuser for a user-defined transform. Admittedly that was already the case with v1.

Could you elaborate here?

Yes I was thinking about semantic segmentation, something like this with a corresponding Transform class (not high priority but nice to have):

def mask2indices(masks: Mask, labels: torch.Tensor, default=255):
    *s, n, h, w = masks.shape[-2:]
    out = torch.full((*s, h, w), default, dtype=labels.dtype, device=labels.device)
    for i in range(n):
        out.masked_fill_(masks[..., i, :, :], labels[i])

    return out
pmeier commented 1 year ago

The synchronized augmentation of multiple dependent samples (ex: contrastive methods, multi-crops, etc): it still seems complicated from what I gather

I'm sorry, but I don't really understand your point here. As stated above, batch transforms, i.e. transforms that need multiple samples, are not supported yet, but we are working on a design to enable them. We are aware that they are important, but we need a smooth UX that is an actual improvement over what we have in v1 for this.

If it goes beyond that, could you give me an example?

I wanted to also mention that some folks might want a form of broadcasting rules (same augmentation for a batch of frames in a video).

That is fully supported. Meaning, if you put in a datapoints.Video of shape (*, T, C, H, W) where T describes the individual frames, each frame will be transformed the same. The * above denotes arbitrary batch sizes. Meaning you can put in whatever shape you want there and everything will be transformed the same.

vadimkantorov commented 1 year ago

There are also pure tensor functions in https://github.com/pytorch/vision/blob/main/torchvision/transforms/_functional_tensor.py, but the underscore suggests it's now private. Is it superseded by https://github.com/pytorch/vision/blob/main/torchvision/transforms/v2/functional/__init__.py ?

Is it so? As always, I'm supporter of adding pure-tensor, no-dispatch functions as they can be used even without buying into the v2 design (IMO overly complex, but I already expressed it many times so won't go into details again)

Also agree with @nlgranger that for more custom pipelines, the get_params explicit calls are more inspectable/explicit/clear and controllable (with respect to per-sample params, per-batch params or broadcastable params). Also, along these lines, it's important to support generator= argument wherever any random values are being generated (and thus only torch random generators should be used, not numpy and not python ones)

Also, a question: what's the motivation of still supporting PIL images class everywhere through the transforms pipeline? I think this was important earlier because many transforms were available only in PIL, but now it's less the case. And also torchvision could reconstruct the PIL object without copies on the fly and then get back a regular tensor. Dropping completely PIL image class support would simplify the API for the users, especially if there are no more reasons for it perf-wise (especially that now there is torchvision.io.read_image).

NicolasHug commented 1 year ago

There are also pure tensor functions in https://github.com/pytorch/vision/blob/main/torchvision/transforms/_functional_tensor.py, but the underscore suggests it's now private

Yes, we've deprecated those files which should have been private from the beginning.

Is it superseded by https://github.com/pytorch/vision/blob/main/torchvision/transforms/v2/functional/__init__.py ?

Yes, all the "pure tensor functions" you need are the low-level kernels ending in _tensor, e.g. resize_image_tensor(), pad_image_tensor(), etc. (those that work on videos, masks and bbox don't have the _tensor suffix but they still operate on tensors.)

Also, a question: what's the motivation of still supporting PIL images class everywhere through the transforms pipeline?

We wanted the transforms to be 100% backwards compatible. Today, the tensor backend is in general faster than (non-SIMD) PIL, but that's only because of recent improvements that were made to Resize(). At the time these transforms were design this wasn't the case, and perf was taken into consideration too.

And also torchvision could reconstruct the PIL object without copies on the fly and then get back a regular tensor

No, the layout of PIL images is different than for pytorch tensor - there's always the alpha channel in PIL, even if the image is pure RGB. We can't have a no-copy conversion in most cases unfortunately.

Dropping completely PIL image class support would simplify the API for the users

No, this wouldn't make the API simpler for users - just the code simpler for us maintainers.

vadimkantorov commented 1 year ago

No, this wouldn't make the API simpler for users - just the code simpler for us maintainers.

I haven't stumbled in docs an explanation on the choice between pure tensors and PIL image (both serving as image representations), when one is preferable and when another is preferable (earlier on there was no image reading api besides PIL, but now torchvision.io.read_image exists). I bet new users are asking the same. So if PIL images are important for functionality and perf, it's important to explain this somewhere. If PIL images are supported only for backcompat and not intended for modern codebases, then I would propose it's communicated more clearly via the docs, tutorials, examples...

Currently the docs pages have only these mentions of PIL:

Parameters: backend (string) – Name of the image backend. one of {‘PIL’, ‘accimage’}. The accimage package uses the Intel IPP library. It is generally faster than PIL, but does not support as many operations.


- https://pytorch.org/vision/stable/transforms.html:

Most transformations accept both PIL images and tensor images, although some transformations are PIL-only and some are tensor-only. The Conversion may be used to convert to and from PIL images, or for converting dtypes and ranges.



So maybe supporting RGBA images for regular tensors / functions would be a solution (enable zero-copy and simplify the API) + perf tests of all transforms against pillow-simd? number of channels divisible by 4 may also be nicer for SSE code

I think in the conditions of growth of computer vision community and more new users coming, the baggage of backward compat will take too much (especially if now perf problems are fixed). Packages still actively maintained would migrate, and the unmaintained packages can still use old versions of torchvision for reproducibility purposes...

Basically currently the same transform class (even deterministic like GaussianBlur) would likely produce minorly different results depending on whether PIL.Image or Tensor is piped in - also not very obvious thing.
nlgranger commented 1 year ago

the baggage of backward compat will take too much

I'm not against breaking BC either if it helps, but I'd like to point out that there are many unmaintained repositories out there corresponding to past publications. It is convenient to have a straightforward and documented upgrade path for them.

Ignasijus commented 1 year ago

Hello, I see that torchvision.datapoints supports four types of datapoints:

I want to ask why there is no a 'Keypoint' type? A keypoint is a point with x and y coordinates. The type 'Keypoint' or 'Keypoints' (for many points) would be very usefull in this new V2 API.

NicolasHug commented 1 year ago

Yeah, technically there's nothing that prevents us from adding a Keypoint datapoint and add support for it in the transforms. We have rather limited support for keypoint detection (compared e.g. to bbox detection), so we haven't prioritized KeyPoint support for now, but this isn't set in stone. We'd love to know more about your use-cases @Ignasijus.

Ignasijus commented 1 year ago

Thank you for your answer @NicolasHug. Well, a keypoint is quite a common label in computer vision tasks, for example, face keypoints like nose, eyes, etc. And not only for detection but also for extracting features from a feature map at particular coordinates. I don't think it's a good idea to only prioritize models and datasets that are available at PyTorch because many people are using PyTorch for custom professional projects with custom datasets, while this feature of making a geometrical transformation to both an image and positional labels on that image at the same time is quite common.

eirikeve commented 11 months ago

Hi! The transforms v2 API looks very nice. I'm seeing some issues with datasets.wrap_dataset_for_transforms_v2 when using a DataLoader with > 0 workers.

Is this a known issue (or am I using it wrong)?

Here's a snippet to reproduce it. It assumes you've got COCO downloaded at the specified paths.

from torch.utils.data import DataLoader
from pathlib import Path
from torchvision import datasets

dataset = datasets.wrap_dataset_for_transforms_v2(
    datasets.CocoDetection(
        "/path/to/coco/images/train2017",
        "/path/to/coco/annotations/instances_train2017.json",
        transforms=None,
    )
)

dataloader = DataLoader(dataset, batch_size=1, num_workers=2)

batch = next(iter(dataloader))

Expected behaviour: The dataloader yields a batch of data

Observed behaviour: The DataLoader fails to set up the workers due to a pickling error.

Traceback (most recent call last):
  File "<string>", line 17, in <module>
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/eivester/miniforge3/envs/ball-detection/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'coco_dectection_wrapper_factory.<locals>.wrapper'

Edit: Here's the versions I've installed:

torch==2.0.1
torchvision==0.15.2

_Edit 2: For anyone having similar issues, here's a v2 wrapper around CocoDetection using the methods from torchvision/datapoints/_dataset_wrapper.py. Works as expected for DataLoaders with multiple workers_:

from pathlib import Path
from typing import Any, Tuple

import torch
from pycocotools import mask
from torch.utils.data import DataLoader
from torchvision import datapoints
from torchvision.datapoints._dataset_wrapper import (
    list_of_dicts_to_dict_of_lists,
)
from torchvision.datasets import CocoDetection
from torchvision.transforms.v2 import functional as F

class CocoDetectionV2(CocoDetection):
    def __init__(
        self, root: str, annFile: str, transforms: Callable[..., Any] | None = None
    ) -> None:
        super().__init__(root, annFile)
        self.v2_transforms = transforms

    def __getitem__(self, index: int) -> Tuple[Any, Any]:
        sample = super().__getitem__(index)
        sample = self.wrapper(index, sample)
        if self.v2_transforms is not None:
            sample = self.v2_transforms(*sample)
        return sample

    def segmentation_to_mask(self, segmentation, *, spatial_size):
        """Copied from `torchvision/datapoints/_dataset_wrapper.py`"""

        segmentation = (
            mask.frPyObjects(segmentation, *spatial_size)
            if isinstance(segmentation, dict)
            else mask.merge(mask.frPyObjects(segmentation, *spatial_size))
        )
        return torch.from_numpy(mask.decode(segmentation))

    def wrapper(self, idx, sample):
        """Copied from `torchvision/datapoints/_dataset_wrapper.py`"""
        image_id = self.ids[idx]

        image, target = sample

        if not target:
            return image, dict(image_id=image_id)

        batched_target = list_of_dicts_to_dict_of_lists(target)

        batched_target["image_id"] = image_id

        spatial_size = tuple(F.get_spatial_size(image))
        batched_target["boxes"] = F.convert_format_bounding_box(
            datapoints.BoundingBox(
                batched_target["bbox"],
                format=datapoints.BoundingBoxFormat.XYWH,
                spatial_size=spatial_size,
            ),
            new_format=datapoints.BoundingBoxFormat.XYXY,
        )
        batched_target["masks"] = datapoints.Mask(
            torch.stack(
                [
                    self.segmentation_to_mask(segmentation, spatial_size=spatial_size)
                    for segmentation in batched_target["segmentation"]
                ]
            ),
        )
        batched_target["labels"] = torch.tensor(batched_target["category_id"])  # type: ignore
        return image, batched_target

def collate(batch):
    return tuple(zip(*batch))

if __name__ == "__main__":
    data_dir = "/path/to/coco"

    dataset = CocoDetectionV2(
        str(Path(data_dir, "images/train2017")),
        str(Path(data_dir, "annotations/instances_train2017.json")),
        transforms=None,
    )

    loader = DataLoader(dataset, num_workers=2, collate_fn=collate)
    print(next(iter(loader)))
pmeier commented 11 months ago

@eirikeve

There is one detail missing from the report. You are on macOS, correct? We were unable to reproduce it at first, there is a difference in default behavior of the DataLoader between macOS / Windows and Linux. By default, Linux uses fork to create a new process. macOS and Windows use spawn, which requires that the whole pipeline is pickleable. See https://pytorch.org/docs/stable/data.html#platform-specific-behaviors for details.

To reproduce on Linux one needs

import multiprocessing

def no_collate(batch):
    return batch

dataloader = DataLoader(
    dataset,
    batch_size=1,
    num_workers=2,
    collate_fn=no_collate,
    multiprocessing_context=multiprocessing.get_context("spawn"),
)

Is this a known issue (or am I using it wrong)?

No and no. This is a missing feature on our side. Let me investigate how much work it will take to make the wrapped dataset pickleable and get back to you. Note that this is not just the dataset wrapper, but also all transforms.

eirikeve commented 11 months ago

@pmeier

Yes, I'm on MacOS. Thanks for the feedback!

csmotion commented 11 months ago

Howdy all,

I'm attempting to update the TorchVision Object Detection Finetuning Tutorial to use torchvision.transforms.v2 (instead of the earlier v0.8.2 torchvision/reference/detection/transforms.py) as specified here: https://stackoverflow.com/questions/73198329/no-module-named-engine-error-in-pytorch-tutorial.

Problem: The transforms applied when calling self.transforms (in the custom DataSet class), are not consistently applied to both the image and the target dataset. When I read here, it reads that the transforms should be applied consistently, is that correct, or not so for Random transforms: https://pytorch.org/blog/extending-torchvisions-transforms-to-object-detection-segmentation-and-video-tasks/

Versions: torch 2.0.1+cu117 torchvision 0.15.2+cu117 ipykernel 6.25.1 ipython 8.14.0 jupyter_client 8.3.0 jupyter_core 5.3.1

Relevant code snippets:

import torchvision.transforms.v2 as T

def get_transform(train:bool=False):
    transforms = []
    transforms.append(T.ToTensor())
    if train:
        transforms.append(T.RandomHorizontalFlip())
        transforms.append(T.RandomVerticalFlip())
        transforms.append(T.RandomResize(min_size=400, max_size=840))
        # transforms.append(T.RandomAdjustSharpness(sharpness_factor=0.75))
    return T.Compose(transforms)

in Custom Dataset getitem

if self.transforms is not None:
    # Results in different transforms applied to image and target
    image, target = self.transforms(image, target)

    # Try individual application to see if it's something it doesn't like about target dict (same result)
    image, masks, boxes, labels = self.transforms(image, masks, boxes, labels)

It makes sense that the v0.8.7 version of transforms works as expected (I see all the transforms take in image, target arg pairs and operate on both).

Recommendations on how to proceed? I'm excited to use transforms.v2 though, augmenting data will be a breeze!

Thanks

pmeier commented 11 months ago

@csmotion The code you posted looks correct. I suspect what is missing is that you didn't convert the items inside target to datapoints? That is what is needed for the transform to handle them correctly. Have a look at the datapoints FAQ and an end-to-end example using transforms v2.

csmotion commented 11 months ago

@pmeier You are 100% correct, I missed it in the blog post (RIP). Much appreciated!

pmeier commented 10 months ago

@eirikeve With the upcoming release the v2 wrapper is now also pickleable and thus will work with a multiprocessing spawning context as is the default on macOS. The patch should be available as nightly release in a few hours.

fvgt commented 10 months ago

I am trying to use the CutMix augmentation following the guide on the web page: https://pytorch.org/vision/main/auto_examples/v2_transforms/plot_cutmix_mixup.html#sphx-glr-auto-examples-v2-transforms-plot-cutmix-mixup-py However, I get the error: 'module 'torchvision.transforms.v2' has no attribute 'CutMix''

vfdev-5 commented 10 months ago

@fvgt you may need to install torchvision from source: https://github.com/pytorch/vision/blob/main/CONTRIBUTING.md#development-installation

pmeier commented 10 months ago

@fvgt you are looking at the documentation for the main branch, but you are likely using a stable release. CutMix and MixUp will only become available in the next release. If you are not restricted by the version you use, you can install a nightly release that already has this implemented.

fvgt commented 10 months ago

That was my first intuition as well and I tried using the nightly version, using the following command:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

Unfortunately, I still got the same error.

Edit: I will follow the guide posted by @vfdev-5 and will check again. Thank you very much for the quick replies!

pmeier commented 10 months ago

@fvgt This is most likely an environment issue. Of course it should work as well, but there is no need to build from source. The nightly build is sufficient. Please open a separate issue and post the results of the environment collection script.

tianlianghai commented 10 months ago

the cutmix only support for classification task for now, hope v2.cutmix support segmentation task! thanks.

orena1 commented 9 months ago

If I have a batch of videos and I want to run v2.RandomPerspective on each video differently. Currently if I use this format:

        self.train_video_pipeline = torch.nn.Sequential(
            v2.RandomPerspective(0.5,1),
            torchvision.transforms.Normalize(0.15228, 0.0794))

All videos will have the same transformation.

train_video_pipeline = torch.nn.Sequential(
            v2.RandomPerspective(0.2,0.6),
torchvision.transforms.Normalize(0.15228, 0.0794))

out = train_video_pipeline(1+torch.zeros((10,123,1,100,100))) # e.g. 10 videos each with 123 time-steps, one channel
import matplotlib.pyplot as plt
axs = plt.subplots(1,3,figsize=(13,4))[1]

axs[0].imshow(out[0,7,0].detach().numpy())
axs[1].imshow(out[1,7,0].detach().numpy())
axs[2].imshow(out[2,7,0].detach().numpy())

Is there anyway to have different transformation for each video?

image

NicolasHug commented 9 months ago

Hi @orena1 , the main way to do that is to unbatch, call the random transforms individually on all samples (or use `.get_params + functional API)), and then re-batch the samples.

This is something we'd like to support more transparently, perhaps at least by providing some kind of UnBatchThenCallThenRebatch transform helper (name TBD). But because of the way random parameters are sampled, and because each randomization leads to different parametrizations, there is often no way to process an entire batch efficiently.

Axel-Jacobsen commented 8 months ago

Howdy!

The TVTensors + V2 transforms are a pretty cool addition. I'm finding it easy to integrate into one of my current projects, which is great.

I found and am using v2.ConvertBoundingBoxFormat, but haven't found anything that would e.g. normalize bounding box coordinates to the size of the image. E.g., if the image is (100px, 100px), and the center of the image is at (50px, 50px) with (w, h) = (25px, 25px), the normalized coordinates would be (xc, yc, w, h) = (0.5, 0.5, 0.25, 0.25). Normalization of bbox coordinates is frequent in object detection, e.g. the YOLO family of networks.

pmeier commented 8 months ago

@Axel-Jacobsen

Is this already implemented somewhere?

No.

Perhaps there is a good reason why it isn't in there

Yup. Right now we hard-assume that bounding boxes are in absolute coordinates. This makes it easier to implement the corresponding kernels:

  1. We don't need an extra flag on the kernel and subsequently on the bounding box instance that indicates whether the coordinates are absolute or relative.
  2. We don't need extra branching logic inside the kernel to account for both use cases.

From your comment I get that normalized bounding boxes are only required for the model. If that is true, I suggest you implement a custom NormalizeBoundingBoxes transform and just put it at the end of your pipeline. Something along the lines of

import torch
from torchvision import tv_tensors

def normalize_bounding_boxes(bounding_boxes: tv_tensors.BoundingBoxes, dtype=torch.float32) -> torch.Tensor:
    canvas_height, canvas_width = bounding_boxes.canvas_size
    # The .as_subclass(torch.Tensor) is not required, but only a performance improvement
    # See https://pytorch.org/vision/stable/auto_examples/transforms/plot_tv_tensors.html#why-is-this-happening
    return (
        bounding_boxes.as_subclass(torch.Tensor)
        .to(dtype)
        .div_(
            torch.tensor(
                [canvas_width, canvas_height, canvas_width, canvas_height],
                dtype=dtype,
                device=bounding_boxes.device,
            )
        )
    )

class NormalizeBoundingBoxes(torch.nn.Module):
    def forward(self, image, target):
        target["boxes"] = normalize_bounding_boxes(target["boxes"])
        return image, target

image = tv_tensors.Image(torch.rand(3, 100, 100))
bounding_boxes = tv_tensors.BoundingBoxes(
    [[50, 50, 25, 25]], 
    format=tv_tensors.BoundingBoxFormat.CXCYWH, 
    canvas_size=(100, 100),
)
target = {"boxes": bounding_boxes}

transform = NormalizeBoundingBoxes()
transformed_sample = transform(image, target)

torch.testing.assert_close(
    transformed_sample[1]["boxes"],
    torch.tensor([[0.5, 0.5, 0.25, 0.25]]),
)

This requires you to hardcode the schema of the samples that you want to pass. If you need a version of the transform that works for arbitrary sample schemas, as is the default for all builtin v2 transforms, you can do:

from torchvision.transforms import v2 as transforms

class NormalizeBoundingBoxes(transforms.Transform):
    _transformed_types = (tv_tensors.BoundingBoxes,)

    def _transform(self, input, params):
        return normalize_bounding_boxes(input)

But be aware that we are using private parts of the API here and there no BC guarantees for them.

Axel-Jacobsen commented 8 months ago

@pmeier sounds good! I appreciate the quick and thorough reply. I'll give this a go in my project.

EricThomson commented 6 months ago

Thanks for making this new API for transformations it's great!

I was sent here from a link on the ToDtype page, as I'm trying to figure out the intent and consequences of the scale param.

My understanding was that (for instance for a float32 tv_tensor) it was supposed to scale values to [0,1]. This is based partly on the page's description of the scale arg, where it links to section in docs on Dtype and expected value range, which is 0-1 for float32. But when I feed it a tensor with values outside of that range, it returns values in the same deviant range.

I dug around in the implementation a bit, and while there is some checking to see if the data types support scaling, I'm not seeing any actual computational consequences of the scale param: https://github.com/pytorch/vision/blob/d23430765b5df76cd1267f438f129f51b7d6e3e1/torchvision/transforms/v2/_misc.py#L206

But there is a good chance I'm just missing something too 😄 . At any rate, I can implement it myself easily enough, but I was confused by scale and what it is doing.

vfdev-5 commented 6 months ago

@EricThomson ToDtype calls functional implementation: F.to_dtype here: https://github.com/pytorch/vision/blob/d23430765b5df76cd1267f438f129f51b7d6e3e1/torchvision/transforms/v2/_misc.py#L275C40-L275C40 where scale arg is used and finally, this code is run for images: https://github.com/pytorch/vision/blob/d23430765b5df76cd1267f438f129f51b7d6e3e1/torchvision/transforms/v2/functional/_misc.py#L210 You can see there a quick return when scale is False: https://github.com/pytorch/vision/blob/d23430765b5df76cd1267f438f129f51b7d6e3e1/torchvision/transforms/v2/functional/_misc.py#L214-L215

NicolasHug commented 6 months ago

But when I feed it a tensor with values outside of that range, it returns values in the same deviant range

@EricThomson if you pass a torch.float32 tensor to ToDtype(torch.float32), then scaling won't happen regardless of whether you set scale=True. The conversion (and scaling) will only happen if the input tensor is not of float dtype.

To convert a float tensor form an arbitrary scale to another, you could use Normalize instead.

EricThomson commented 6 months ago

Thanks @vfdev-5 for pointing out in more detail how kernel dispatching works (I'm embarrassed I didn't go deeply enough :flushed: ). The logic becomes clear in to_dtype_image()

@NicolasHug thanks for explaining in more detail and the suggestion. I'm not sure if Normalize is what I want, as that would push to a certain std/mn, while what I really want is scaling to [0,1].

Clearly I was trying to get ToDtype to do something outside its current use: I can create a transform to scale my data for floats. That said, not sure if folks would be against adding scaling for floats in to_dtype_image() when scale is set to True at some point in the future?

NicolasHug commented 6 months ago

what I really want is scaling to [0,1].

Normalize just returns (x - mean) / std so you can use it to linearly map any interval [a, b] into [c, d]. But I acknowledge that it can potentially be counter-intuitive to use.

adding scaling for floats in to_dtype_image() when scale is set to True at some point in the future?

To clarify the feature request: you mean converting from an arbitrary scale into [0, 1], where the arbitrary scale of the input x is determined by x.min() and x.max()?

EricThomson commented 6 months ago

@NicolasHug nice! I can piggyback on Normalize , with mean = min(), and std= max()-min(). Thanks for the suggestion -- I'll just put a comment in my code about what I'm doing so people aren't confused. :smile:

In terms of the feature request, yes that is what I was suggesting.

NicolasHug commented 5 months ago

Thank you so much everyone for your input and feedback. The V2 transforms are now stable and part of the latest torchvision release https://github.com/pytorch/vision/releases/tag/v0.17.0.

I'll close this issue as it's getting quite big and somewhat outdated now, but we'd still love to hear from you! Please feel free to open new issues with any feedback or feature requests you may have!