[Question / Feature Request] PyTorch dataset abstraction

HalkScout commented 3 weeks ago

All of the documentation I have sifted through has referenced essentially re-saving data when the format changes, but is there a way to use this library without that? A good use-case for this example is if you have a very large amount of data. You can read your data if it's a supported format, some conversion occurs, then you pass it into your pipeline. It would add to the cost of data-loading, but it can be worth it if it saves terabytes of disk space.

I am loading a dataset like this:

from datumaro.components.dataset import Dataset
dataset = Dataset.import_from("./data", "yolo")
print(dataset)

Dataset
    ...
subsets
    test: # of items=...
    train: # of items=...
    val: # of items=...
infos
    ...

Where I would want to implement a PyTorch Lightning data module like:

import lightning as L
from datumaro.components.dataset import Dataset

class MyDataModule(L.LightningDataModule):
    def __init__(self, batch_size: int = 32):
        super().__init__()
        self.batch_size = batch_size

    def setup(self, stage: str):
        # Being able to load only specific subsets would be nice here too, but that sounds like a large undertaking
        dataset = Dataset.import_from("./data", "yolo")
        if stage == "fit":
            dataset_train = dataset.get_subset("train")
            dataset_val = dataset.get_subset("val")
        if stage == "test":
            dataset_test = dataset.get_subset("test")

    def train_dataloader(self):
        return DataLoader(self.dataset_train, batch_size=self.batch_size)

    # and so on for "test" and "val"

Is this possible? Neither of these solutions I have tried work:

train = dataset.get_subset("train")
print(train.__getitem__(0))

AttributeError: 'DatasetSubset' object has no attribute '__getitem__'

Or even an attempt to make a wrapper:

train = dataset.get_subset("train")
print(train.get(0))

---> [96]     assert (subset or DEFAULT_SUBSET_NAME) == (self.name or DEFAULT_SUBSET_NAME)
AssertionError:

The problem I am running into is that the subsets are not able to be separated from the main dataset & are not treated as their own dataset. Could I be doing anything differently?

This is the main thing stopping me from using this really useful library in my pipeline, as I can really see potential but it doesn't offer specifically the dataloading features I am looking for (which might be on purpose). If anyone knows of a good method / tool to do this, I would love to hear! Thank you 😄

HalkScout commented 3 weeks ago

As referenced in this feature request and PR: https://github.com/openvinotoolkit/datumaro/issues/1212, https://github.com/openvinotoolkit/datumaro/pull/1247

itrushkin commented 2 weeks ago

Hi @HalkScout. Thanks for your interest in Datumaro! 😊

The main purpose of changing the data format is to export it to disk. See our notebook with format change.

Regarding indexing, while the .get_subset() method's return type doesn't currently support direct indexing, you can easily convert it to a dm.Dataset using .as_dataset(). This will allow you to use standard indexing operations.

Here's a code example:

>>> train = dataset.get_subset("train").as_dataset()
>>> print(train[0])

In your second snippet, make sure you're passing the correct arguments to the .get() method. The first argument should be the desired item's id (usually the image file name), and the second argument should be the actual name of the subset, which in this case is 'train'.

If you're still encountering issues, please provide more details about your specific use case, and I'll be happy to assist further.

openvinotoolkit / datumaro

[Question / Feature Request] PyTorch dataset abstraction #1627