[RFC] Stereo Matching Datasets API

🚀 The feature

The proposed feature aims to extend the current datasets API with datasets that are geared towards the task of Stereo Matching. It's main use case is that of providing a unified way for consuming classic Stereo Matching datasets such as:

Other considered dataset additions are: Sintel, FallingThings, InStereo2K, ETH3D, Holopix50k. A high level preview of the dataset interface would be:

class StereoMatchingDataset(Dataset):
    def __init__(self, ...):
        # constructor code / dataset specific code

    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        # processing code
        # ...

        # imgs: Tuple[Tensor, Tensor] 
        # dispartieis: Tuple[Tensor, Tensor]
        # occlusion_masks: Tuple[Tensor, Tensor]

        if self.transforms is not None:
            imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)

        img_left = imgs[0]
        img_right = imgs[1]
        disparity = disparities[0]
        occlusion_mask = occlusion_masks[0]

        return img_left, img_right, disparity, occlusion_mask

Motivation, pitch

This API addition would cut down engineering time required for people that are looking into working on, experimenting, or evaluating Stereo Matching models or that want easy access to stereo image data.

Throughout the literature, recent methods (1, 2) make use of multiple datasets that all have different formatting or specifications. A unified dataset API would streamline interacting with different datasources at the same time.

Alternatives

The official repo for RAFT-Stereo provides a similar functionality for the datasets on which the network proposed in the paper was trained / evaluated. The proposed StereoMatchingDataset API would be largely similar to it, whilst following idiomatic torchvision.

Additional context

Stereo Matching task formulation.

Commonly throughout the literature the task of stereo matching requires a reference image (traditionally left image), its stereo pair (traditionally the right image), the disparity map (traditionally the left->right disparity) between the two images and an occlusion / validity mask for pixels from the reference image that do not have a correspondent in the stereo pair (traditionally left->right). The proposed API would server data towards the user in the following manner:

Proposal 1.

class StereoMatchingDataset(Dataset):
    def __init__(self, ...):
        # constructor code / dataset specific code

    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        # processing code
        # ...

        # imgs: Tuple[Tensor, Tensor] 
        # dispartieis: Tuple[Tensor, Tensor]
        # occlusion_masks: Tuple[Tensor, Tensor]

        if self.transforms is not None:
            imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)

        img_left = imgs[0]
        img_right = imgs[1]
        disparity = disparities[0]
        occlusion_mask = occlusion_masks[0]

        return img_left, img_right, disparity, occlusion_mask

The above interface for data consumption is more aligned with the larger dataset ecosystem in torchvision where a dataset provides all the required tensors to perform training. However, this approach makes the assumption the user / algorithm does not require the right disparity map or the right occlusion mask. An alternative to this assumption would be a modification of the interface such that the user may be able to access the right-channel annotations:

Proposal 2.

def __getitem__(self, index: int) -> Tuple[Tuple, Tuple, Tuple]:
        # processing code
        # ...

        # imgs: Tuple[Tensor, Tensor] 
        # dispartieis: Tuple[Tensor, Tensor]
        # occlusion_masks: Tuple[Tensor, Tensor]

        if self.transforms is not None:
            imgs, disparities, occlusion_masks = self.transforms(imgs, disparities, occlusion_masks)

        return imgs, disparities, occlusion_masks

# in user land, the API feeds all the available data to the user 
for (imgs, disparities, occlusion_masks) in stereo_dataloader:
        # however, the user becomes responsible to deconstruct the batch in order to get
        # the classic task definition data
        img_l, img_r, disparity, occlusion_mask = imgs[0], imgs[1], disparities[0], occlusion_masks[0]
        # ...

User feedback would be highly appreciated as it is highly unlikely one can be aware of all the use-cases / methods in Stereo Matching. Some preliminary pros and cons for each proposal:

Proposal 1

Pros:

Provides strong guarantees about the data (not all datasets provide disparities for both views, i.e ETH3D)
Follows the common specification of the Stereo Matching task.
The user is provided with a familiar / idiomatic experience and receives all the necessary data for training with no need for additional data handling / manipulation

Cons:

It can restrict data access to the user with no out of the box way of recovering it (i.e. right disparity maps / occlusion masks). This would render the API unusable for some use-cases (if there are any).

Proposal 2

Pros:

The user gets all the data (left / right annotations instead of just left)

Cons:

Breaks away from the standard of other dataset APIs in torchvision.
Forces the user to check his data / data merging strategies (i.e. using ETH3D would yield None for the right channel annotations)
Users need to manually unpack the data into tensors that are provided to models / losses.

cc @pmeier @YosuaMichael

Thanks for starting this discussion @TeodorPoncu !

Do you have thoughts on how the Stereo Matching task compares with Optical Flow? We already have Optical Flow datasets like Kitti, HD1K, FlyingThings3D and they return something quite similar to what you described above:

https://github.com/pytorch/vision/blob/e75a333782bb5d5ffdf5355e766eb5937fc6697c/torchvision/datasets/_optical_flow.py#L65-L72

Another thing I'm wondering: what would Proposal 2 look like for datasets that do not have the right disparity map or the right occlusion mask? Would disparities and occlusion_masks be tuples with only one element?

Side note: all this will hopefully become easier as we're revamping our datasets to return dictionary with arbitrary keys, which would allow us to return whatever we want depending on availability. But there's no clear ETA on that yet.

An approach similar to the one in the optical flow dataset would indeed lead to tuples with only one element @NicolasHug and require to user to make checks on the sizes / shapes of the batch elements. Optical Flow generally tracks pixel shifts through time from one view, whilst Stereo Matching tracks pixel shifts between two views coming from two separate cameras captured at the same time.

Some functionality that would be nice to have would be that of providing an easy of creating a DatasetWrapper that can sample from several StereoMatchingDatasets at the same time as it is a technique used recently in the SotA: CREStereo. For that particular use case Proposal 1 is better suited as it guarantees data uniformity.

Providing users a way to mix and match datasets under Proposal 2 would require some rather hacky solutions. Users would either have to provide their custom collate_fn if they somehow find themselves mixing different format datasets or rely on placeholder Tensors and loss masking or other such mechanisms.

Whilst the multi-dataset use case is not necessarily a functionality that must be provided to the end user, it is at least a higher-order functionality that is recurrent in several other Vision tasks such as classification, segmentation, detection etc. Should a method arise where both disparity maps are utilised and we had opted for Proposal 1, that would imply changing the API both internally and it terms of its interface.

Thanks for the details @TeodorPoncu

Sounds like we don't yet have a clear idea about the usefulness of returning the right [disparity, mask]. If we ever do, hopefully by then our new API will be out, in which case we could add them rather easily, and without breaking anything.

Since proposal 1 is also consistent with what we have for optical flow datasets, it seems that option 1 is the strongest so far.

Thanks for the effort of making stereo datasets available. I recently started to use some of these so I'd like to add my user experience data point into the discussion in case it could be useful.

It seems to me a bit counter-intuitive that transforms function has different inputs than the dataset items. I expected that these would be the same. In that respect, the approach in the optical flow dataset seems to me more intuitive. Also, is there any reason why to transform data which won't get out anyway?

Another thing: wouldn't it be useful to allow the transforms to modify what outputs are returned - and return that directly as a dataset item? Datasets I checked don't allow that now but typically it wouldn't break anything there either if allowed. If that was allowed one would be able to make datasets compatible simply by providing a fitting transforms function suitable for that use case.

Hey @tpet! Thank you for your input and feedback as it is very valuable!

At that time the picture wasn't very clear of how users would prefer to interact with this specific vision task. One of the cases we were trying to build around was that of training using batches containing samples from several datasets at the same time, similarly to what was loosely described in the CREStereo paper.

Something we were taking into account was trying to have a unified way of calling the transforms irrespective of the dataset.

Because some augmentation techniques, namely horizontal flipping, require having both a valid left and a right disparity map in order to be performed, we wanted to somehow unify the returns / behaviour of each dataset.

Even though multiple inputs are necessary to perform said type of augmentations, a "classic" stereo-matching pipeline still requires only some of them in order to perform the optimization step: (left image, right image, disparity, Optional[mask]). As such, the behaviour we were looking for in terms of functionality should've looked something like:

for batch in stereo_dataset_loader:
  left, right, disp, mask = batch
  preds = model(left, right)
  loss = criterion(preds, mask)
...

Where stereo_dataset_loader was a simple torch.utils.data.DataLoader which possibly received as input either one StereoMatchingDataset or a torch.utils.data.ConcatDataset composed from multiple StereoMatchingDataset that were all initialised with the same transform chain.

The choice at the time stemmed from the fact that if each type of dataset was to return only the items it contains, irrespective of the transform chain, then users would've had to define independent transform chains for each dataset and a custom batch collate function for the DataLoader.

It would be very much appreciated if you could share what you believe to be a desired / ideal workflow when dealing with Stereo Matching related tasks / training pipelines as that would provide very valuable insights into what users might want when utilising these datasets.

I agree that it should be possible to easily create batches from multiple datasets. I was thinking that the transforms function could be used to make the datasets compatible. In cases where one needs to combine datasets providing different things (e.g. some are missing the masks, some have only left disparity), it seems that one shared transform chain may not be enough anyway and it may be good to expose this fact to the user.

Personally, I would prefer to keep the input of the transforms function identical to what the dataset provides, so that one can think of it just as conversion from dataset[i] to transforms(*dataset[i]) (perhaps transforms(**dataset[i]) for dict examples). In case it is required that the transforms function have common interface for all stereo datasets, then maybe all the datasets should return output in the same form.

Regarding the usage of the transforms function and its outputs, I would allow it change what goes to the output. That would add a lot of flexibility and may then help to get different datasets to common ground. If the compatibility effort is already done in the transforms function, I suppose one could then use the default collate function.

pytorch / vision