Open TeodorPoncu opened 2 years ago
Thanks for starting this discussion @TeodorPoncu !
Do you have thoughts on how the Stereo Matching task compares with Optical Flow? We already have Optical Flow datasets like Kitti, HD1K, FlyingThings3D and they return something quite similar to what you described above:
Another thing I'm wondering: what would Proposal 2
look like for datasets that do not have the right disparity map or the right occlusion mask? Would disparities
and occlusion_masks
be tuples with only one element?
Side note: all this will hopefully become easier as we're revamping our datasets to return dictionary with arbitrary keys, which would allow us to return whatever we want depending on availability. But there's no clear ETA on that yet.
An approach similar to the one in the optical flow dataset would indeed lead to tuples with only one element @NicolasHug and require to user to make checks on the sizes / shapes of the batch elements. Optical Flow generally tracks pixel shifts through time from one view, whilst Stereo Matching tracks pixel shifts between two views coming from two separate cameras captured at the same time.
Some functionality that would be nice to have would be that of providing an easy of creating a DatasetWrapper
that can sample from several StereoMatchingDatasets
at the same time as it is a technique used recently in the SotA: CREStereo. For that particular use case Proposal 1
is better suited as it guarantees data uniformity.
Providing users a way to mix and match datasets under Proposal 2
would require some rather hacky solutions. Users would either have to provide their custom collate_fn
if they somehow find themselves mixing different format datasets or rely on placeholder Tensors
and loss masking or other such mechanisms.
Whilst the multi-dataset use case is not necessarily a functionality that must be provided to the end user, it is at least a higher-order functionality that is recurrent in several other Vision tasks such as classification, segmentation, detection etc. Should a method arise where both disparity maps are utilised and we had opted for Proposal 1
, that would imply changing the API
both internally and it terms of its interface.
Thanks for the details @TeodorPoncu
Sounds like we don't yet have a clear idea about the usefulness of returning the right [disparity, mask]. If we ever do, hopefully by then our new API will be out, in which case we could add them rather easily, and without breaking anything.
Since proposal 1 is also consistent with what we have for optical flow datasets, it seems that option 1 is the strongest so far.
Thanks for the effort of making stereo datasets available. I recently started to use some of these so I'd like to add my user experience data point into the discussion in case it could be useful.
It seems to me a bit counter-intuitive that transforms function has different inputs than the dataset items. I expected that these would be the same. In that respect, the approach in the optical flow dataset seems to me more intuitive. Also, is there any reason why to transform data which won't get out anyway?
Another thing: wouldn't it be useful to allow the transforms to modify what outputs are returned - and return that directly as a dataset item? Datasets I checked don't allow that now but typically it wouldn't break anything there either if allowed. If that was allowed one would be able to make datasets compatible simply by providing a fitting transforms function suitable for that use case.
Hey @tpet! Thank you for your input and feedback as it is very valuable!
At that time the picture wasn't very clear of how users would prefer to interact with this specific vision task. One of the cases we were trying to build around was that of training using batches containing samples from several datasets at the same time, similarly to what was loosely described in the CREStereo paper.
Something we were taking into account was trying to have a unified way of calling the transforms irrespective of the dataset.
Because some augmentation techniques, namely horizontal flipping, require having both a valid left and a right disparity map in order to be performed, we wanted to somehow unify the returns / behaviour of each dataset.
Even though multiple inputs are necessary to perform said type of augmentations, a "classic" stereo-matching pipeline still requires only some of them in order to perform the optimization step: (left image, right image, disparity, Optional[mask])
. As such, the behaviour we were looking for in terms of functionality should've looked something like:
for batch in stereo_dataset_loader:
left, right, disp, mask = batch
preds = model(left, right)
loss = criterion(preds, mask)
...
Where stereo_dataset_loader
was a simple torch.utils.data.DataLoader
which possibly received as input either one StereoMatchingDataset
or a torch.utils.data.ConcatDataset composed from multiple StereoMatchingDataset
that were all initialised with the same transform chain.
The choice at the time stemmed from the fact that if each type of dataset was to return only the items it contains, irrespective of the transform chain, then users would've had to define independent transform chains for each dataset and a custom batch collate function for the DataLoader.
It would be very much appreciated if you could share what you believe to be a desired / ideal workflow when dealing with Stereo Matching related tasks / training pipelines as that would provide very valuable insights into what users might want when utilising these datasets.
I agree that it should be possible to easily create batches from multiple datasets. I was thinking that the transforms function could be used to make the datasets compatible. In cases where one needs to combine datasets providing different things (e.g. some are missing the masks, some have only left disparity), it seems that one shared transform chain may not be enough anyway and it may be good to expose this fact to the user.
Personally, I would prefer to keep the input of the transforms function identical to what the dataset provides, so that one can think of it just as conversion from dataset[i]
to transforms(*dataset[i])
(perhaps transforms(**dataset[i])
for dict examples). In case it is required that the transforms function have common interface for all stereo datasets, then maybe all the datasets should return output in the same form.
Regarding the usage of the transforms function and its outputs, I would allow it change what goes to the output. That would add a lot of flexibility and may then help to get different datasets to common ground. If the compatibility effort is already done in the transforms function, I suppose one could then use the default collate function.
🚀 The feature
The proposed feature aims to extend the current datasets API with datasets that are geared towards
the task of Stereo Matching
. It's main use case is that of providing a unified way for consuming classic Stereo Matching datasets such as:Other considered dataset additions are: Sintel, FallingThings, InStereo2K, ETH3D, Holopix50k. A high level preview of the dataset interface would be:
Motivation, pitch
This API addition would cut down engineering time required for people that are looking into working on, experimenting, or evaluating Stereo Matching models or that want easy access to stereo image data.
Throughout the literature, recent methods (1, 2) make use of multiple datasets that all have different formatting or specifications. A unified dataset API would streamline interacting with different datasources at the same time.
Alternatives
The official repo for RAFT-Stereo provides a similar functionality for the datasets on which the network proposed in the paper was trained / evaluated. The proposed
StereoMatchingDataset
API would be largely similar to it, whilst following idiomatictorchvision
.Additional context
Stereo Matching task formulation.
Commonly throughout the literature the task of stereo matching requires
a reference image
(traditionally left image),its stereo pair
(traditionally the right image),the disparity map
(traditionally the left->right disparity) between the two images andan occlusion / validity mask
for pixels from the reference image that do not have a correspondent in the stereo pair (traditionally left->right). The proposed API would server data towards the user in the following manner:Proposal 1.
The above interface for data consumption is more aligned with the larger dataset ecosystem in
torchvision
where a dataset provides all the required tensors to perform training. However, this approach makes the assumption the user / algorithm does not require the right disparity map or the right occlusion mask. An alternative to this assumption would be a modification of the interface such that the user may be able to access the right-channel annotations:Proposal 2.
User feedback would be highly appreciated as it is highly unlikely one can be aware of all the use-cases / methods in Stereo Matching. Some preliminary pros and cons for each proposal:
Proposal 1
Pros:
Cons:
Proposal 2
Pros:
Cons:
torchvision
.None
for the right channel annotations)tensors
that are provided tomodels / losses
.cc @pmeier @YosuaMichael