Closed datumbox closed 9 months ago
Could you post the link to the blog post in this thread when it becomes available? I am not sure where else to look to find it.
Can you provide a good starting point for getting an overview of the current state of Transforms V2?
Or is reading through https://github.com/pytorch/vision/tree/main/torchvision/prototype/transforms and https://github.com/pytorch/vision/projects/5 the best approach for now?
@rsokl @jangop We got a blogpost in the pipeline that provides an overview. We are waiting for marketing to publish. I'll post the link here once we do. Until then those 2 references are the best places to look. There is no documentation because the API was being modified but all transforms receive the same exact parameters as V1.
@rsokl @jangop we published the blogpost. Looking forward to your input.
Hi all,
I just read the blog post and have been exploring the prototype's internals for the last few of days: torchvision.transforms
v2 looks great! I'd like to share my thoughts with you all.
First, let me summarize some of the primary features that are offered by v2. This will, in part, help me make sure that I have a clear understanding of things. v2 includes:
_Feature
class.Transform
class. Here, an implementation draws random parameters within _get_params
, and an individual tensor/feature is transformed by _transform
.I had been doing a review of augmentation libraries, including kornia, albumentations, and augly, when I came across v2 for torchvision. To me, your new API is the simplest, most capable, and the easiest to extend. I particularly like how simple Transform
is and that you do not rely on sprawling class hierarchies to dispatch functionality.
So first of all, thank you for all of the hard work that you have been doing on this effort. The PyTorch community is fortunate to benefit from this excellent work β€οΈ.
There are two features that I would like to propose. I would be happy to assist with these if there is interest.
torch.Generator
to transforms.In recent years, NumPy has completely revised their PRNG API to avoid global random state (here is a great post on good practices with NumPy's generators). JAX avoids mutable RNG objects altogether. PyTorch provides torch.Generator
to users to to make randomness local and "non-spooky", but many libraries prevent users from utilizing this capability.
I am proposing that Transform
enable users to optionally pass in a Generator
to the forward pass so that torchvision transform pipelines can be made to be isolated from global entropy and thus support more reproducible workflows. This reproducibility is especially useful in the context of performing testing & evaluation β the specific sequence of data transformations performed should be able to be isolated from whether or not a model is using dropout in its forward pass.
The local generator (whose default value is PyTorch's global generator) could be passed to the forward
method and piped to the _get_params
method. This assumes that all random number generation occurs in self._get_params
. The following would be a compatibility-preserving modification to Transform
; by default the global seed would still control stochasticity in the transforms.
from torch import Generator, Tensor, default_generator
import torch.nn as nn
class Transform(nn.Module):
def _get_params(self, flat_inputs: List[Any], *, generator: Generator = default_generator) -> Dict[str, Any]:
...
def forward(self, *inputs: Any, generator: Generator = default_generator) -> Any:
# the only modification
params = self._get_params(flat_inputs, generator=generator)
and transforms that implement _get_params
would replace calls like
# e.g. replace calls like
angle = float(torch.empty(1).uniform_(0.0, 180.).item())
with
# specifying the device is, unfortunately, necessary: https://github.com/pytorch/pytorch/issues/79018
angle = float(torch.empty(1, device=generator.device).uniform_(0.0, 180., generator=generator).item())
A transform like Compose
would have to be modified as well. Currently, it supports a sequence of callables that are assumed to accept a single positional argument. It could be assumed that only instances of Transform
involve stochasticity and will be passed the random generator. In this case, Compose
would look like:
class Compose(Transform):
# __init__ is unchanged
def forward(self, *inputs: Any, generator: Generator = default_generator) -> Any:
sample = inputs if len(inputs) > 1 else inputs[0]
for transform in self.transforms:
sample = transform(sample) if not isinstance(transform, Transform) else transform(sample, generator=generator)
return sample
It would be straightforward to document this behavior to users β that only instances of Transform
are passed the generator β so that they know how to opt-in to having the generator be passed to their custom transforms. And, again, this would be compatible with the old nn.Module
transforms.
An example of this in practice would be:
from torch import Generator
rng = Generator.manual_seed(0)
trans = T.Compose(
[
T.ColorJitter(contrast=0.5),
T.RandomRotation(30),
T.CenterCrop(480),
]
)
img, bboxes, labels = trans(img, bboxes, labels, generator=rng)
Another nice thing about this is that specific fail cases that occur during training/testing can be reproduced in an isolated way; _get_params(dummy_img, generator=rng)
can be used to iterate the generator's state to "replay" a sequence of transformations without have to redo all of the compute. Whereas this would not work if the model and the transforms both affect and derive from global state.
Feature
to the public API and adding a protocol for dispatching transforms on 3rd party featuresThis request a bit more ambitious, but I think that it could have a big pay-off.
The crown jewel of torchvision.transforms
v2 is its added support for features like bounding boxes and segmentation masks. Presently, the development of new features and transforms is gated on the development efforts of the intrepid torchvision team. It would be great to enable users to implement their own features (e.g. Polygon
or ImageTrack
) and for torchvision's transforms (or 3rd party transforms) to be able to dispatch on those features. This would reduce the burden on the torchvision team in that it enables users to develop bespoke features that have "drop-in" compatibility with standard torchvision pipelines, without you needing to officially support the feature.
Thus I am proposing to a) make _Feature
part of the public API (take away that leading underscore π ) so that 3rd parties can create custom features, and b) provide a dispatch mechanism by which a 3rd party feature can coerce a transform (either 1st party or 3rd party transform) into dispatching to its own implementation of said transform. E.g. I could write a Polygon
feature that would be able to implement its own affine transformation that both torchvision.transforms.affine
and torchvision.transforms.RandomAffine
dispatch to when they operate on a Polygon
tensor.
There is already some degree of dispatching going on among some transforms. E.g., for affine
and RandomAffine
, any _Feature
subclass can simply implement a Feature.affine
method. The affine
function checks isinstance(inp, Feature)
and then calls inp.affine(...)
. And RandomAffine._transform
simply calls affine
under the hood, which performs the aforementioned dispatching.
That being said, users can't rely on all functions and transformations to be implemented this way. Indeed, Transform._transformed_types
will often preclude such synergistic dispatching (RandomAffine._transformed_types
works here because it only contains Tensor
). It seems to me that, to support this proposal, Transform._transformed_types
would need to be replaced by a Feature
-centric way of documenting compatibility with transforms.
Given that these torchvision's transforms are all unary functions, I believe that the dispatch mechanism could be kept relatively simple. That being said, I don't have deep insight into how JIT compatibility might impact the feasibility of some of the design ideas that I have been playing around with. I also only have a rudimentary understanding of __torch_function__
and __torch_dispatch__
, but I get the sense that torchvision's transforms would not be able to leverage these.
Ultimately, I do not have a concrete design for this dispatch protocol, but would be happy to help develop it with you.
Thank you very much for your consideration. And thank you again for your excellent work on v2. My colleagues and I are excited to start using it.
Thanks a ton for the kind words @rsokl! I think your summary is on point. Let me share my thoughts on your proposals.
torch.Generator
+1 from me on basically everything you proposed. I think the design is simple enough that this can be done quickly in case we decide to adopt it. pytorch/pytorch#79018 is unfortunate here, but if we as torchvision
need that, we might be able to speed things up on core.
The only thing that I would change about your proposal is the existence of a default value for generator
on _get_params
. During normal operation, _get_params
should not be called directly and thus we can be explicit about it. IIUC, this would also not interfere with your use case of fast-forwarding the RNG without doing the actual computation since there you also don't want to use the default generator.
Regarding this use case, please be aware that it will only work for most transforms. Some more elaborate transforms like the AutoAugment
family or SimpleCopyPaste
don't fit into the regular Transform
scheme that we devised. They overwrite forward
directly and perform the random parameter generation there.
features._Feature
publicLet's go through the proposal point by point
a) make
_Feature
part of the public API (take away that leading underscore :smile: ) so that 3rd parties can create custom features
Opinions about this changed over the course of the development. We started with it being public, but made it private at some point. Recently, we found a use case for it to be public again (https://github.com/pytorch/vision/pull/6663#discussion_r982567437). A cautious +1 from my side on this.
b) provide a dispatch mechanism by which a 3rd party feature can coerce a transform (either 1st party or 3rd party transform) into dispatching to its own implementation of said transform.
As you have explained above, this is already happening for some dispatchers like F.affine
. However, there are a few caveats that need more careful design:
Not all kernels have dispatchers. For example
because there would only be a single kernel. Even if we added such a dispatcher for consistency we would get into hot waters, since the dispatcher would violate the rule "for BC plain tensors need to be treated as image or video" since there is no image or video kernel to dispatch to. This rule is currently kept by all our other dispatchers which allows us to keep them BC even with regard to JIT. Of course there is no BC concern for new functions, but it is a lot easier to say "the functional API is JIT compatible" rather than "the functional API that was already there for v1 is JIT compatible, but for the remainder it depends".
Internally, a lot of transformations look up the spatial size from the inputs with
This is already pretty flexible and can pull the information from a lot of types, but there is no "official" protocol yet. For now we are using the spatial_size
attribute
but this only a convention.
Taking 3. one step further, some transformations also query bounding boxes like that:
Without a more elaborate protocol, any bounding-box-like feature like RotatedBoundingBox
proposed in #2761 cannot be a free feature, but has to subclass features.BoundingBox
. Otherwise it won't be picked up and in turn the transformation will fail.
These are just the limitations from the top of my head. There will likely be more if take a closer look. Thus, the design for this will be more complex.
That being said some of these points mentioned above are already in our backlog, but we focused on performance the past few weeks. I agree allowing users to extend our API like that would be very powerful, but we need to solve all rough edges before we can promote this as official part of the API. Let's hear what @datumbox and @vfdev-5 say about this before I cast my vote.
I had been doing a review of augmentation libraries, including kornia, albumentations, and augly, when I came across v2 for torchvision. To me, your new API is the simplest, most capable, and the easiest to extend. I particularly like how simple
Transform
is and that you do not rely on sprawling class hierarchies to dispatch functionality.
@rsokl Would you perhaps share a few words on why / how / in which regard Transforms V2 is βthe simplest, most capable, and the easiest to extendβ? Reading through the blog post, the one section I missed was related work for context. Personally, I only have experience with albumentations. Coming from that, I am inclined to agree with your assessment.
Is there an estimate in terms of either time, or version number, or both, for when the v2 transforms will be included in pytorch-stable? The blog says "planned for Q1" but I wonder if there's a better estimate than that somewhere.
I was about to implement, on a small scale, functions that basically do what v2 does, but if your prototype works well then I will use it for my project.
Should I expect any issues if I train a model on the nightly builds with the v2 transforms, but then run it for predictions on pytorch-stable 1.13? Other than the v2 transforms, I do not use any unusual features.
test case: image segmentation with SegFormer. My dataset has images and masks organized in tuples. I am following this example:
https://huggingface.co/blog/fine-tune-segformer
I need to make sure that geometric transforms (horizontal flip, rotation, affine, perspective) are applied randomly, but are applied the exact same way to each image and its corresponding mask. I am testing with horizontal flip and the image and the mask are flipped in an uncoordinated fashion.
Code example (admittedly inefficient):
from torchvision.prototype.transforms import (
Compose,
RandomApply,
ColorJitter,
RandomRotation,
RandomCrop,
RandomAffine,
RandomHorizontalFlip,
RandomPerspective,
)
from torchvision.prototype import features
from torchvision.prototype.features import ColorSpace
feature_extractor = SegformerFeatureExtractor()
augmentations = Compose(
[
# RandomApply([ColorJitter(brightness=0.5, contrast=0.5)], p=0.75),
RandomHorizontalFlip(p=0.5),
# RandomApply([RandomRotation(degrees=45)], p=0.75),
# RandomApply([RandomAffine(degrees=0.0, scale=(0.5, 2.0))], p=0.25),
# RandomPerspective(distortion_scale=0.5, p=0.25),
]
)
def train_transforms(example_batch):
# original
images = [augmentations(x) for x in example_batch["pixel_values"]]
labels = [x for x in example_batch["label"]]
inputs = feature_extractor(images, labels)
return inputs
def train_transforms2(example_batch):
# my version
batch_items = list(
zip(
[x for x in example_batch["pixel_values"]],
[x for x in example_batch["label"]],
)
)
batch_items_aug = [
augmentations(
features.Image(
np.swapaxes(np.array(x[0]), 0, -1), color_space=ColorSpace.RGB
),
features.Mask(np.swapaxes(np.array(x[1]), 0, -1)),
)
for x in batch_items
]
images, labels = map(list, zip(*batch_items_aug))
inputs = feature_extractor(images, labels)
return inputs
train_ds.set_transform(train_transforms2)
This is one entry in the train dataset that the transforms are applied to:
{'dataset': 0,
'pixel_values': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=400x322>,
'tumor': 1,
'dataset_tumor': 1,
'label': <PIL.PngImagePlugin.PngImageFile image mode=L size=400x322>}
If I repeatedly visualize the image and the mask in train_ds[0]
, they are flipped randomly, but the flip is not applied the same to the image and its mask - they are each flipped as a completely separate random process.
What I expect:
Hey @FlorinAndrei
Is there an estimate in terms of either time, or version number, or both, for when the v2 transforms will be included in pytorch-stable? The blog says "planned for Q1" but I wonder if there's a better estimate than that somewhere.
We are aiming to publish it with the next release, i.e. torchvision==0.15.0
. The exact date is not yet available. We operate on a 3-4 months release cycle and the last release just happened on the 28th of October. So you can expect this to be released in February or March 2023.
Should I expect any issues if I train a model on the nightly builds with the v2 transforms, but then run it for predictions on pytorch-stable 1.13? Other than the v2 transforms, I do not use any unusual features.
No. Apart from JIT, the transforms are BC and thus there should be no complications. You can have a look at #6433 where we do just that.
I am testing with horizontal flip and the image and the mask are flipped in an uncoordinated fashion.
You are not flipping the masks at all:
images = [augmentations(x) for x in example_batch["pixel_values"]]
labels = [x for x in example_batch["label"]]
The horizontal flip is happening in augmentations
but this is never called on any example_batch["label"]
. Even if you were using it in the line above, the augmentation cannot be same for images and masks since you apply them separately.
Transforms v2 can handle arbitrary input structures and so you don't need to handle images and masks separately. You can just pass them into the transforms together like:
augmented_batch = [augmentations(sample) for sample in example_batch]
images, labels = zip(*augmented_batch)
inputs = feature_extractor(images, labels)
The only caveat of this is that example_batch["label"]
will be treated as image for BC. Meaning, the color transformations will also be applied to it. Thus, we need to communicate to the transformations that it is a mask.
In transforms v2 this is done by wrapping the data into custom tensor classes located under torchvision.prototype.features
:
from torchvision.prototype import features, transforms
from torchvision.prototype.transforms import functional as F
class WrapIntoFeatures(transforms.Transform):
def forward(self, sample):
image, label = sample
label = features.Mask(F.pil_to_tensor(mask))
return image, label
augmentations = transforms.Compose([WrapIntoFeatures(), ...])
For custom datasets this will have to happen manually. For our builtin ones, we currently explore the options. See #6662 for the current favorite.
I've put together a small notebook for you. I hope it helps. but let us know if you encounter anything.
I understand the need of wrapping the torch.Tensor class to mimic the behavior of "matching" transformations for a specific input type using _transformed_types
class variable. However, as far as I can understand the user directly handles / sets these types. This might lead to inconsistent behavior with functionalities commonly used and provided by inheritance of the th.Tensor class. I have attached a simple multiplication of two image-typed-like-feature classes below to highlight my concerns. I see two directions:
(a) It is fine for "transforms v2", that operation on two feature.Image types as input defaults back to a non features.Image tensor. I think this behavior might feel inconsistent for some users? (b) The operations (multiplication, addition, substraction, etc.) need to be implemented. However, this opens the hells gate to a lot of manual implementation overhead, e.g. what is the multiplication of a black-and-white image and a RGB image. Or what is the multiplication of an features.Image and a features.Video? --> Implementation Hell. Not to mention what would happen if a feature.Image would be passed to a neural network... feature.Image * th.Tensor..
It might be that my concerns are irrelevant: I have just read through the code briefly. However, I think that this behavior needs to be specified / discussed upfront.
import torch as th
from torchvision.prototype.features import Image
img1 = Image(th.rand(3, 256, 256))
img2 = Image(th.rand(3, 256, 256))
type(img1), type(img2)
>> torchvision.prototype.features._image.Image, torchvision.prototype.features._image.Image
img = img1 * img2
type(img)
>> torch.Tensor
@maxoppelt
mimic the behavior of "matching" transformations for a specific input type using
_transformed_types
class variable.
Not sure I understand. Transform._transformed_types
is a way to specify which types you want to transform or on the flip side for which you want no-op behavior. Most transformations use the default
meaning all tensors and PIL images will be transformed.
This might lead to inconsistent behavior with functionalities commonly used and provided by inheritance of the th.Tensor class.
Inside our transformations, we make sure that there is no inconsistent behavior. If you pass in a features.Image
you will also get one back. That is handled on the functional dispatcher level, i.e. F.resize
or the like will give you this behavior. There are a few exceptions like
I think that this behavior needs to be specified / discussed upfront.
The behavior is specified and encoded here:
The function has extensive comments what it is doing, but let me give you the TL;DR: except for very few operators, namely
there is no fast and safe way for us to determine if the result of the operation should retain the feature type or not. You already highlighted a few cases above. Thus, any operation except for the ones mentioned above will "unwrap" the result, i.e. give you a plain torch.Tensor
back.
Now, there are times where the result should retain the input type. This is happening all over our transforms. In that case you will need to wrap again, like
output = ...
return features.Image(output)
As a shorthand if you don't want to copy the metadata manually, you can use features.Image.wrap_like(input, output)
, e.g.
(Note that the .as_subclass(torch.Tensor)
call before we call the kernel is a micro-optimization and not required)
@pmeier: Yes I agree and I think it's great that you can define Transformations with op or no-op depending on the specified feature
. Thanks for laying out the implementation again, but my concern is more about the user interface and the way of specifying these types. Let me give you a small example, a user would assume (without reading our last two comments in this feedback ticket) would work:
As addition to your provided example, here are two masks for your image above.
from torchvision import io, utils
from torchvision.prototype import features, transforms as T
from torchvision.prototype.transforms import functional as F
# Loading data
img = features.Image(io.read_image('COCO_val2014_000000418825.jpg'), color_space=features.ColorSpace.RGB)
smask = features.Mask((io.read_image('strawberry.png') != 0).type(th.uint8)) # has a no-op for GaussianBlur
tmask = features.Mask((io.read_image('tomato.png') != 0).type(th.uint8)) # has a no-op for GaussianBlur
sotmask = smask | tmask # is now a simple tensor and _isinstance(..., _is_simple_tensor) --> has op for GuassianBlur
# Defining and applying Transforms V2
trans = T.Compose(
[
T.GaussianBlur(kernel_size=7),
]
)
np.all((trans(img) == img).numpy())
>> False. # Expected
np.all((trans(smask) == smask).numpy())
>> True # Expected
np.all((trans(sotmask) == sotmask).numpy())
>> False # This is what I am talking about...
Edit: Fixed, code based on @pmeier's comment below.
@maxoppelt thanks for your feedback and the discussion ! Concerning your point about
operation on two feature.Image types as input defaults back to a non features.Image tensor. I think this behavior might feel inconsistent for some users?
and your examples:
type(img1), type(img2)
>> torchvision.prototype.features._image.Image, torchvision.prototype.features._image.Image
img = img1 * img2
type(img)
>> torch.Tensor
and
np.all((trans(sotmask) == sotmask).numpy())
>> False # This is what I am talking about...
What do you think would be a consistent/expected behaviour ? Should they raise an error instead ?
@maxoppelt
but my concern is more about the user interface and the way of specifying these types. Let me give you a small example, a user would assume (without reading our last two comments in this feedback ticket) would work:
You are right that the communication of this needs to be clear. Right now documentation is scarce, but this will change before the release. Basically the rule is: "if you perform any operation on tensor subclasses outside of our builtin transforms, you need to wrap again". We will have detailed documentation about how to write a custom transform that will cover this.
We thought about this very issue a lot during the design phase. In the end we decided against an automatic detection whether or not the result can retain the type or not due to the issues listed above. And the behavior you mention is just the logical consequence of this.
Unless we missed something obvious, there is no better way to do this if we want to use the type of the input as dispatch mechanism. The solutions of comparable libraries also require you to be explicit either by calling the transform like transform(..., mask=sotmask)
or providing this information when the transform is created transform = MyTransform(..., order=["image", "mask"])
. Thus, we are not an outlier when it comes to being explicit with transform(..., features.Mask(sot_mask))
. Do you have a better idea?
P.S.: features.Mask
is the right type for masks. features.Label
is for numeric labels like
>>> label = features.Label([0, 1, 1], categories=["strawberry", "tomato"])
>>> label
Label([0, 1, 1])
>>> label.to_categories()
['strawberry', 'tomato', 'tomato']
@vfdev-5
What do you think would be a consistent/expected behavior ? Should they raise an error instead ?
Depends, I could image something like
type(img)
> torchvision.prototype.features._image.Image
type(mask)
> torchvision.prototype.features._mask.Mask
result = image * mask
type(result)
> torchvision.prototype.features._image.Image
would be self-explanatory and result in something like this:
Raising an error is probably too much, as it might be a valid operation... It is just not clear that any operation using a feature
class will result a non-feature
class, but in a th.Tensor
class.
However, I kind of agree with @pmeier, as these new feature
classes are mainly there to define op vs. no-op for inputs in torchvision transformations v2. My question is only pointing in the direction of unspecified usage: Which is, I might add, probably happening very often in the future. For example: The Dataset class provides e.g. two variables of type torchvision.prototype.features._image.Image and torchvision.prototype.features._label.Label, that should be transformed using the torchvision transformation pipeline v2. However in-between these two steps, it is very likely the user is accessing the data, e.g. for a special normalization or a reshape operation?
An alternative approach could look like this:
type(img1), type(img2), type(mask)
> th.Tensor, th.Tensor, th.Tensor
trans = T.Compose(
[
T.GaussianBlur(kernel_size=7),
T.ColorJitter(contrast=0.5),
T.RandomRotation(30),
T.CenterCrop(480)
],
input_types = [Image, Image, Mask]
)
trans(img1, img2, mask)
instead of this:
type(img1), type(img2), type(mask)
> torchvision.prototype.features._image.Image, torchvision.prototype.features._image.Image, torchvision.prototype.features._mask.Mask
trans = T.Compose(
[
T.GaussianBlur(kernel_size=7),
T.ColorJitter(contrast=0.5),
T.RandomRotation(30),
T.CenterCrop(480)
],
)
trans (img1, img2, mask)
However, this approach might have some disadvantages, too: Type and value are separated and why not use https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor (like currently done...). At the end it is probably an issue that could be resolved/prevented in the future by writing a good documentation like @pmeier pointed out. Maybe one could use python warnings to inform the user...
Btw. why is the new torchvision type system called features? In most cases torchvision transforms will work on the input and not latent representation (usually called features).
@pmeier In the example above, I actually use train_transforms2()
which is applied to both images and masks. However, that is not why my code was failing to apply the transforms in a coordinated fashion. I've tried Albumentations instead, and I got the same bad result.
The reason is that I was invoking the dataset twice: once to get the image, and again to get the mask. Those were two separate applications of the augmentation function, and of course they were not coordinated in terms of random geometric transforms.
All that was fixed when I extracted that single item from the dataset once, and then extracted the image and the mask from the item.
Apologies for the non-issue.
@FlorinAndrei Thanks for taking the time to provide in depth feedback. We really appreciate it. Please keep it coming; perhaps the confusion caused indicates there are still rough edges on the API that we need to fix. Or perhaps we need to document the gotcha's better. If you have other ideas on the overall API (both public and developer) or on the naming conventions please let us know. :)
@maxoppelt
Btw. why is the new torchvision type system called features? In most cases torchvision transforms will work on the input and not latent representation (usually called features).
You are absolutely right. This is very confusing. This is a placeholder name until we find something better. We need a concept for the base tensor class that can be reused for Images, Videos, BoundingBoxes, Labels, Masks etc. Naming is NP-hard, any help on that front would be highly appreciated...
We need a concept for the base tensor class that can be reused for Images, Videos, BoundingBoxes, Labels, Masks etc.
Perhaps DataTensor
? It risks being a bit vague, but it does convey "these are things are all forms of training/testing data".
The naming issue is not new. See for example the thread in #5045. We used "feature" in the beginning since that is what tensorflow-datasets
used. X-posting some of the comments / suggestions from the other thread here
https://github.com/pytorch/vision/issues/5045#issuecomment-988603265
any name even more obscure will be better than "feature", even VisionFeature in one word, VisionTrait, VisionModality, anything is better than features.feature.
https://github.com/pytorch/vision/issues/5045#issuecomment-989794497
My thought was naming
Feature
asStructure
. In this way we match the detectron2 names. https://github.com/facebookresearch/detectron2/tree/main/detectron2/structures
https://github.com/pytorch/vision/issues/5045#issuecomment-1033642077
re the name "Feature," aside from being an overloaded term, is it also incorrect? Usually a label is not a feature -- I would think a Label subclassing Feature would imply self-supervision. I want to suggest "example," but aside from also being overloaded, I think that only works if it's a feature and label. Here I think we need a term that can refer to either a feature or label, but not both.
Hi. As a PhD student working on object detection and instance segmentation, I'm very happy to see this added to torchvision. I'm eager to use this new transforms API in my own code base.
When doing (box-based) instance segmentation, both target boxes and target masks need to be provided to the model. As written in the blog post, one would use
img, bboxes, masks, labels = trans(img, bboxes, masks, labels)
to apply the transforms to the various data structures.
By applying each of the transforms on the boxes and masks individually (as I believe is done in Transforms V2), one might expect the transformed bounding boxes to tightly fit the transformed object masks, as was the case before the applying the transforms.
However, some operations like the crop operation might result in inconsistencies between the cropped boxes and masks, with the cropped boxes no longer tightly fitting the cropped masks. See https://github.com/facebookresearch/detectron2/blob/main/detectron2/data/dataset_mapper.py#L135 for a toy example. I believe the same might happen to other transforms, such as e.g. rotation-based transforms.
I'm not sure how much these inconsistencies affect the performance (I expect not much), but I think this issue needs to be addressed. If not, some users might think there's some kind of bug as they would expect the transformed bounding boxes to always tightly fit the transformed masks.
Regardless of the precise solution to this problem, I think a get_bounding_boxes(bbox_format)
method should be added to the Mask structure.
Solution 1: Change nothing to the API. It's up to the users to decide on how they want to transform the boxes. If the user wants the transformed boxes to always tightly fit the transformed masks, then the user can proceed as follows:
img, masks, labels = trans(img, masks, labels)
bboxes = masks.get_bounding_boxes(bboxes.format)
Solution 2: Change the API. I'm not sure what the cleanest solution would be. Maybe something like
img, bboxes, masks, labels = trans(img, bboxes, masks, labels, get_bboxes_from_masks=True)
could be considered where internally something like
if get_bboxes_from_masks:
# TODO: Remove bboxes from pytree and apply transforms as usual
bboxes = masks.get_bounding_boxes(bboxes.format)
# TODO: Add bboxes to output
else:
# The usual
would be used.
Hey @CedricPicron and thanks for your feedback. I agree that we need to touch on this in the documentation to explain what is happening.
Regardless of the precise solution to this problem, I think a
get_bounding_boxes(bbox_format)
method should be added to the Mask structure.
There is already an operator for it: torchvision.ops.masks_to_boxes
. We haven't wrapped it inside a transform or a method on Mask
yet. This is still up for discussion.
As for the proposed solutions, with the op from above, you can easily add a custom transform that does what you want:
from typing import *
from torchvision.prototype import transforms, features
from torchvision.prototype.transforms import functional as F
from torchvision.ops import masks_to_boxes
# We are currently debating whether we should make this public
from torchvision.prototype.transforms._utils import has_all
# This is modeled after `query_bounding_box` located in `torchvision.prototype.transforms._utils`
def query_mask(flat_inputs: List[Any]) -> features.Mask:
masks = [inpt for inpt in flat_inputs if isinstance(inpt, features.Mask)]
if not masks:
raise TypeError("No mask was found in the sample")
elif len(masks) > 1:
raise ValueError("Found multiple masks in the sample")
return masks.pop()
class TightenBoundingBoxes(transforms.Transform):
_transformed_types = (features.BoundingBox,)
def _check_inputs(self, flat_inputs: List[Any]) -> None:
# Of course, we could also make this a no-op in case we don't find both
if not has_all(flat_inputs, features.Mask, features.BoundingBox):
raise TypeError("TightenBoundingBoxes needs masks and bounding boxes")
def _get_params(self, flat_inputs: List[Any]) -> Dict[str, Any]:
mask = query_mask(flat_inputs)
tight_bounding_box = masks_to_boxes(mask)
return dict(tight_bounding_box=tight_bounding_box)
def _transform(self, inpt: Any, params: Dict[str, Any]) -> Any:
return features.BoundingBox.wrap_like(
inpt,
F.convert_format_bounding_box(
params["tight_bounding_box"].to(inpt),
old_format=features.BoundingBoxFormat.XYXY,
new_format=inpt.format,
inplace=True,
),
)
Stealing the toy example from above:
import torch
mask1 = features.Mask(torch.triu(torch.ones(4, 4), diagonal=1).unsqueeze(0))
box1 = features.BoundingBox(
[[1, 0, 3, 2]], format=features.BoundingBoxFormat.XYXY, spatial_size=mask1.shape[-2:]
)
print(mask1)
print(box1)
# Emulating a fixed crop transform
mask2 = F.crop(mask1, top=0, left=0, height=4, width=3)
box2 = F.crop(box1, top=0, left=0, height=4, width=3)
print(mask2)
print(box2)
transform = TightenBoundingBoxes()
mask3, box3 = transform(mask2, box2)
print(mask3)
print(box3)
mask1 Mask([[[0., 1., 1., 1.],
[0., 0., 1., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 0.]]])
box1 BoundingBox([[1, 0, 3, 2]], format=BoundingBoxFormat.XYXY, spatial_size=torch.Size([4, 4]))
mask2 Mask([[[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.],
[0., 0., 0.]]])
box2 BoundingBox([[1, 0, 3, 2]], format=BoundingBoxFormat.XYXY, spatial_size=(4, 3))
mask3 Mask([[[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.],
[0., 0., 0.]]])
box3 BoundingBox([[1, 0, 2, 1]], format=BoundingBoxFormat.XYXY, spatial_size=(4, 3))
Now you can drop this in wherever you need it in your pipeline. Given that this has a performance implication and we currently don't have evidence that this actually impacts the performance of the model, IMO it is best to keep this behavior manual. Plus, it is not as easy as the check you proposed in solution 2 above since these transforms need to work regardless of the task, but object detection may have no masks available.
If you have another look at the output from the example above, there is another case where some manual action is required by the user if we didn't have TightenBoundingBoxes
: after the crop, box2
has the right spatial_size
, but the second coordinate now lies outside of the image. This is intentional since we don't know if the user wants to clamp the bounding boxes or not. Plus, always doing so again has some performance implications.
We already have two builtin transforms that function similar to the one I proposed above:
ClampBoundingBoxes
: clamp bounding boxes to spatial size. This does not remove bounding boxes that are empty after the operation.RemoveSmallBoundingBoxes
: remove small (default empty) bounding boxes and do the same to corresponding masks and labels.Hi @pmeier. Thanks a lot for your detailed response!
I like the proposed solution based on the TightenBoundingBoxes
transform. I think it's a flexible solution, similar to the existing ClampBoundingBoxes
and RemoveSmallBoundingBoxes
transforms. It would be nice to see TightenBoundingBoxes
added to Transforms V2.
Some additional comments:
ClampBoundingBoxes
and TightenBoundingBoxes
can be grouped under bbox
transforms, and RemoveSmallBoundingBoxes
under filter
transforms.TightenBoundingBoxes
, the input bounding boxes do not matter and only the bbox format is used. Transforms prior to the TightenBoundingBoxes
transform hence do not need to be applied on the boxes input. However, currently they are. Avoiding this unnecessary computation could make the pipeline with TightenBoundingBoxes
possibly even faster than without. If desired, this could be mitigated by making _transformed_types
an instance attribute (instead of a class attribute) and allow the user to set a custom _transformed_types
attribute different from the default one.
- I think it's important that users can easily find these transforms. Currently, these are found/implemented in _meta.py and _misc.py. I was wondering if some of these transforms could not be grouped in a different way, such that they can more easily be found. Maybe the transforms
ClampBoundingBoxes
andTightenBoundingBoxes
can be grouped underbbox
transforms, andRemoveSmallBoundingBoxes
underfilter
transforms.
I agree the naming of the modules is not perfect here. On the flip side, I'd wager a guess regardless of what scheme you choose, you will always have outliers that don't fit anywhere.
Plus, you are looking at internal / private namespaces here. I'm fully aware that this is the only thing you can do right now due to the non-existent other documentation, but this will change before release. Meaning, you can discover everything there instead of going through the source. Since this bounding box behavior might indeed be unexpected, I think it would be good to add a small gallery to show the effects.
If you do want to look at the source, I suggest to have a look at the __init__
files for a better overview:
- When using
TightenBoundingBoxes
, the input bounding boxes do not matter and only the bbox format is used. Transforms prior to theTightenBoundingBoxes
transform hence do not need to be applied on the boxes input. However, currently they are. Avoiding this unnecessary computation could make the pipeline withTightenBoundingBoxes
possibly even faster than without. If desired, this could be mitigated by making_transformed_types
an instance attribute (instead of a class attribute) and allow the user to set a custom_transformed_types
attribute different from the default one.
I agree with the statement, but I think there is a far easier solution: don't pass the bounding box. Unless you have a transformation in your pipeline that requires a bounding box to be present, e.g. RandomIoUCrop
, you can simply not pass it and thus save all the computation. If you have transforms that need bounding boxes, I think you can write a thin transforms.Compose
wrapper to achieve this locally instead for the whole pipeline:
from torch.utils._pytree import tree_flatten, tree_unflatten
class NoBoundingBoxesContainer(transforms.Compose):
def forward(self, *inputs):
flat_inputs, spec = tree_flatten(inputs if len(inputs) > 1 else inputs[0])
indexed_mask = None
indexed_bounding_box = None
everything_else = []
for idx, inpt in enumerate(flat_inputs):
if isinstance(inpt, features.BoundingBox):
indexed_bounding_box = (idx, inpt)
elif isinstance(inpt, features.Mask):
indexed_mask = (idx, inpt)
else:
everything_else.append((idx, inpt))
# do a proper error checking here if a mask and box is available
transformed_indexed_mask, transformed_everything_else = super().forward(
indexed_mask, everything_else
)
transformed_indexed_bounding_box = (
indexed_bounding_box[0],
features.BoundingBox.wrap_like(
indexed_bounding_box[1],
F.convert_format_bounding_box(
masks_to_boxes(transformed_indexed_mask[1]).to(indexed_bounding_box[1]),
old_format=features.BoundingBoxFormat.XYXY,
new_format=indexed_bounding_box[1].format,
inplace=True,
),
),
)
flat_outputs = list(
zip(
*sorted(
[
transformed_indexed_bounding_box,
transformed_indexed_mask,
*transformed_everything_else,
],
key=lambda indexed_output: indexed_output[0],
)
)
)[1]
return tree_unflatten(flat_outputs, spec)
Although this looks quite daunting, it actually doesn't do anything complicated. Basically we fish out the mask and bounding box from the input, transform the mask as well as everything else, create a new bounding box from the transformed mask, and assemble everything back together. With this you can do
pipeline = transforms.Compose(
[
NoBoundingBoxesContainer(
[
transforms.RandomRotation(...),
transforms.RandomCrop(...),
]
),
transforms.RandomIoUCrop(...),
NoBoundingBoxesContainer(...),
]
)
Still, IMO we are going deep into specialized transforms here. Unless there is significant demand for something like this in the library, I think you are better off defining such a transform yourself.
Yes, using NoBoundingBoxesContainer
is a nice solution.
I guess the key for users will be to have good documentation and examples regarding the implementation of custom container
-like and transform
-like transforms, but I'm sure you guys already have things planned.
Thanks @pmeier for the quick responses and good luck finalizing this project. I hope the feedback was (somewhat) useful.
In #7002 we are switching the naming of the tensor subclasses from _Feature
and torchvision.features
to Datapoint
and torchvision.datapoints
. TL;DR We found this to be the best balance between a too generic or a too narrow name.
@rsokl Although Datapoint
has no leading underscore anymore, it will still not be public. We decided that your proposal of having it public and thus opening up the API for logic injection by users is a very worthwhile feature to have. Unfortunately, we are not confident that we can properly finish this for the first iteration. v2 brings quite a few improvements over v1 and we want to publish them soon.
That is not to say that we won't add this in the future. Nothing in the proposal is orthogonal to our design, so we can add this feature in a BC manner later. In fact, we intentionally removed the underscore from the name to make it easier for users to experiment with it and report back any issues. You can import it like from torchvision.prototype.datapoints._datapoint import Datapoint
. In the future, you only need to change the import to from torchvision.prototype.datapoints import Datapoint
.
@pmeier Thank you for letting me know. I am very happy to hear that you all think that eventually opening up the API for logic injection would be worthwhile. My colleagues and I think that this will be a very powerful capability. It makes perfect sense why this improvement should be deferred until after v2, in its current form, is published.
I appreciate you all taking the time to consider my proposals and for keeping me in the loop on this. Please let me know if I can be of assistance in any way β brainstorming design ideas, elaborating on aspects of the proposal, etc. β moving forward.
I'd be happy to open separate issues β for passing generators to transforms and for the functionality injection proposal β if that would be helpful at all.
I'd be happy to open separate issues β for passing generators to transforms and for the functionality injection proposal β if that would be helpful at all.
That would be awesome. Makes it easier to track than in this mega thread.
I'd like to know whats the state of support for batches of images etc. Right now it is still a little unclear to me, can someone clear that up?
Hey @twsl. Batches of images (and other features as well) works the same way as it worked before: they are supported, but any kind of random effect of the transform is the same for every image in the batch. So for example you call a RandomRotate
transform with a tensor of shape (B, C, H, W)
, the whole batch will be rotated with the same random angle rather than each image individually. In fact our API was extended to support arbitrary batch sizes, i.e. (*, C, H, W)
where ever possible.
There are a few exceptions though, namely the batch transforms, e.g.
They require the input to be (*, B, C, H, W)
since they operate along the batch dimension B
.
Hey everyone, we have some exciting news to share: after some intense weeks, we managed to bring the prototype transforms in a state that we will be publishing with the upcoming release of torchvision==0.15.0
in March 2023 :tada: You will be able to import the datapoints as torchvision.datapoints
and the transforms as torchvision.transforms.v2
. They will be released in a beta state, meaning, although we are confident about the API, we don't commit to BC just yet. There will be a runtime warning at import telling you as much; we will document the API surfaces where we're unsure about BC in #7319.
We already have some examples up if you want to take a sneak peek:
However, we made some changes to the API after the initial publication of the blog post. They are summarized in #7384 and the blog post was updated accordingly.
@pmeier Awesome to see this being released on v0.15. Makes sense we go for a Beta to collect input from the community.
In the docs, do we have a section showing what are the options with the transforms v2 API. Perhaps you can add it here unless its already there ?
In the docs, do we have a section showing what are the options with the transforms v2 API
Could you be more specific what you mean by that? Are you talking about the parameters passed to transform when initializing? If yes, v2 is a drop-in replacement for v1. Meaning, everything that is supported on v1 also works for v2 in the same way.
Perhaps you can add it here unless its already there ?
Assuming you mean general API documentation, it is already there. We have the banner up top for visibility and the actual documentation interleaved with the v1 documentation below. For example:
@pmeier what's the plan/roadmap for reintroducing the transforms (such as RandomCutMix, RandomMixUp, SimpleCopyPaste, FixedSizedCrop etc) you decided to hold off? These are pretty fundamental augmentations used for SoTA modelling for video and image classification, they have been used in released models and they have been requested numerous times by researchers, internal customers and the community.
We are aware those transforms are useful.
We detailed why we're keeping those transforms in the prototype area for now in https://github.com/pytorch/vision/issues/7384. Basically in their current form, they do not provide any UX improvement over the ones we have in the reference area, and their (currently) clumsy UX is at odds with the rest of the API which is otherwise very clean. We have decided to keep them in the prototype area until we can come up with a significant UX improvement. That may be when/if we release new datasets with native support for batch transforms, or with other solutions.
How can I use the v2 API on a custom dataset? I am following this guide, and wrap_dataset_for_transforms_v2
does not work for it.
Hey @praxxus11. For custom datasets, you'll have to write the wrapper yourself. There are two options here:
Wrap the objects directly in the dataset:
from torchvision import datapoints
from torchvision.transforms.v2 import functional as F
class PennFudanDataset(torch.utils.data.Dataset):
...
def __getitem__(self, item):
...
target["boxes"] = datapoints.BoundingBox(
boxes,
format=datapoints.BoundingBoxFormat.XYXY,
spatial_size=F.get_spatial_size(img),
)
target["labels"] = labels
target["masks"] = datapoints.Mask(masks)
...
if self.transforms is not None:
img, target = self.transforms(img, target)
...
Add a custom wrapping transform to the beginning of your pipeline.
from torchvision import datapoints
from torchvision.transforms.v2 import functional as F
class WrapPennFudanDataset:
def __call__(self, img, target):
target["boxes"] = datapoints.BoundingBox(
target["boxes"],
format=datapoints.BoundingBoxFormat.XYXY,
spatial_size=F.get_spatial_size(img),
)
target["masks"] = datapoints.Mask(target["masks"])
return img, target
...
def get_transform(train):
transforms = []
transforms.append(WrapPennFudanDataset())
transforms.append(T.PILToTensor())
...
After that you no longer need to copy-paste transforms from our reference implementation, but can import from torchvision.transforms.v2
directly. Let us know how it goes for you!
One note of advice though: if you are working on an objection detection task and don't need the masks, it is best to remove them from the sample. If you keep them, the v2 transforms will act on them as well just for them to not be used in the end. See #7494 for details.
Thanks, works fine now. Perhaps this should be made more clear in the documentation?
What parts of the documentation did you look at? I guess we can add a link to this example in case we encounter a "foreign" dataset:
Is this error what brought you here @praxxus11?
Yep, this is the error I got. I think the main issue is that the datapoints FAQ
page is a big sparse with relevant information for custom datasets. I did previously read over the "Do I have to wrap the output of the datasets myself" section, but it wasn't made clear how exactly I had to wrap my __getitem__
function in order to be compatible with the v2 transforms.
Hi ! I am trying to use any transform
function such as RandomErasing
in a semantic segmentation task but the only change appears only in the image (and not in the associated mask). I have the following code :
torchvision.disable_beta_transforms_warning()
import torchvision.transforms.v2 as transforms
...
Class CustomDataset(Dataset):
def __init__(self, ...):
self.transform_img = transforms.Compose(
[
...
transforms.ToTensor(),
]
)
self.transform_mask = transforms.Compose(
[
...
transforms.ToTensor(),
]
)
self.transform_both = transforms.Compose(
[
transforms.RandomErasing(p=1.0)
]
)
def __getitem__(self, idx):
...
image = self.transform_img(image)
mask = self.transform_mask(mask)
return self.transform_both(image, mask) # The random patch is applied only in "image"
From what I understood, we can use transforms.v2
as the previous one but maybe I misunderstood something.
@mlansar Did you wrap your mask in a datapoints.Mask
, i.e.
from torchvision import datapoints
mask = datapoints.Mask(mask)
?
Plus, I'm not sure if the above is just to showcase the issue, but if not, there is no need for a separate transform_img
and transform_img
. You can always use
self.transform_both = transforms.Compose(
[
transforms.ToTensor(),
transforms.RandomErasing(p=1.0),
]
)
@pmeier Yes ! I tried with and without datapoints.Mask
but the mask is still unchanged...
The two transform_*
variable are different in my code : for example in my transform_img
, I have a transforms.Normalize(...)
just after transforms.ToTensor()
so maybe I cannot do it differently
I am trying to use any
transform
function such asRandomErasing
Are you actually talking about "any" transform, or just RandomErasing
? Because the latter intentionally doesn't touch masks:
RandomErasing
is a regularization transform. It intentionally removes parts of the image (or video for that matter), but the model should still classify / segment / detect as if it was the original image. To train this kind of behavior, the mask, as target in a segmentation task, is untouched.
The two
transform_*
variable are different in my code : for example in mytransform_img
, I have atransforms.Normalize(...)
just aftertransforms.ToTensor()
so maybe I cannot do it differently
Its hard to tell from the little information I have here, but Normalize
also only applies to images and videos:
So you can just put it into a pipeline that includes masks and the will ignored correctly.
If you are unsure, please post the sample structure you are using, e.g. two-tuple of image and mask, as well as the pipeline you want to use. Otherwise there is little we can do to help.
Are you actually talking about "any" transform, or just RandomErasing? Because the latter intentionally doesn't touch masks:
Yes, using datapoints.Mask
it works well for other transformation functions... thanks !
RandomErasing is a regularization transform. It intentionally removes parts of the image (or video for that matter), but the model should still classify / segment / detect as if it was the original image. To train this kind of behavior, the mask, as target in a segmentation task, is untouched.
I completely agree with you, but in my use case I want to erase both image and mask with the same rectangle. Since it is not implemented, I tried to find the best solution to do so but everything that I found on GitHub does not work... do you think that the proper (and fastest) way is to change the RandomErasing
function ?
So you can just put it into a pipeline that includes masks and the will ignored correctly. If you are unsure, please post the sample structure you are using, e.g. two-tuple of image and mask, as well as the pipeline you want to use. Otherwise there is little we can do to help.
Thanks for the tip ! Let me show you (without RandomErasing
and datapoints.Mask
):
img_path = self.img_dir + "/image/" + self.image_list[idx]
mask_path = self.img_dir + "/mask/" + self.image_list[idx]
image = Image.open(img_path).convert("RGB")
mask = Image.open(mask_path)
transform_img = transforms.Compose(
[
transforms.Resize(img_size),
transforms.ToTensor(),
transforms.Normalize((0.2, 0.2, 0.2), (0.1, 0.1, 0.1)),
]
)
transform_mask = transforms.Compose(
[
transforms.Grayscale(num_output_channels=1),
transforms.Resize(img_size),
transforms.ToTensor(),
]
)
mask = torch.where(mask >= 0.5, 1, 0)
image = transform_img(image)
mask = transform_mask(mask)
I agree with the fact that PyTorch will ignore Normalize
for the mask but what about the Grayscale
function ?
do you think that the proper (and fastest) way is to change the RandomErasing function ?
If you want RandomErasing
for masks, you can create a thin subclass:
class RandomErasingWithMasks(transforms.RandomErasing):
_transformed_types = (*transforms.RandomErasing._transformed_types, datapoints.Mask)
Using this, this will also applied to masks:
what about the
Grayscale
function ?
Same. Color transforms are not applied to masks:
Your mask pipeline looks kinda sketchy though:
convert("RGB")
the image, but not the mask. I'm assuming the mask is already single channel?torch.where(mask >= 0.5, 1, 0)
looks like you don't have a binary mask to start with?Resize
by default uses bilinear interpolation, which is not suited for masks. However you seem to have non-binary masks so not sure if that is an issue here or not.In any case, unless any follow-up questions are related to transforms v2, let's move this discussion to the user forum to avoid cluttering this feedback thread.
Hello! I've got two questions:
@leonardpasi
any reason why, out of all the transforms, the SanitizeBoundingBox seem to convert BoundingBox to Tensor? Shouldn't this behavior at least be made explicit in the docs?
This is unfortunate bug, that we only detected after release. It was fixed in #7446. The patch is already available in our nightly releases and will be soon in 0.15.2
bug fix release.
wouldn't it be desirable for MeanAveragePrecision in torchmetrics to accept boxes as BoundingBox objects? Right now, a ValueError is raised if the boxes are not Tensor.
I'm not familiar with torchmetrics so I can't comment on that. However, unless they are checking for type equality, it should work:
>>> bounding_box = datapoints.BoundingBox(
... [0, 1, 2, 3],
... format=datapoints.BoundingBoxFormat.XYXY,
... spatial_size=(5, 5),
... )
>>> isinstance(bounding_box, torch.Tensor)
True
>>> type(bounding_box) is torch.Tensor
False
Could you post a MRE for the ValueError
you saw?
π The feature
This issue is dedicated for collecting community feedback on the Transforms V2 API. Please review the dedicated blogpost where we describe the API in detail and provide an overview of its features.
We would love to get your thoughts, comments and input in order to improve the API and graduate it from prototype on the near future.
Please also check out https://github.com/pytorch/vision/issues/7319 where we collect feedback on some specific design decision, and document as well which APIs may change in the future!
Code example using this image: