libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.84k stars 178 forks source link

Indexing (to Subset) Loader Class without having to generate beton files again #315

Open meghbhalerao opened 1 year ago

meghbhalerao commented 1 year ago

Hi, anyone know if there is any way I can get a certain subset of images and corresponding labels from .beton files - for example what I mean is if i want to access a subset of the standard pytorch Dataset class, I can use the Subset class defined within the torch.utils.data, such that I can basically do subset_data = Subset(whole_trainset, subset_idxs), but say I have a Loader class in ffcv, is there any way of doing so?

The worst case would be to generate .beton files again for a subset indexed by the indices, but was wondering if there is any way I can index the Loader object directly?

Thanks and please let me know if anything is unclear.

andrewilyas commented 1 year ago

Hi! the loader object takes in an indices argument that should do what you want.

meghbhalerao commented 1 year ago

Thanks @andrewilyas - so when I am making a dataloader this way - loader_1 = Loader(filepath, indices = list_of_subset_idxs) it works and I am able to index the subset. However, say I have an existing object of the Loader class, say, called loader_2, and I want to index it, I do the following - loader_2_subset = setattr(loader_2, 'indices', list_of_subset_idxs) it does not work and while iterating through the dataloader, it iterates through the whole dataset. Am I doing something wrong? Please let me know and thanks for your time.

andrewilyas commented 1 year ago

I think it's a bit tough to do that since there's a lot of pre-loading that happens inside the initialization of the loader class. I can't think of a use case where once can't just re-initialize the loader class though - is there a specific use case where that's necessary?

meghbhalerao commented 1 year ago

My use case is as follows - I have already defined and instantiated an object of the Loader class (called obj), and the I do some processing using obj which returns a set of indices to me. I now want to use the same obj, but I want to only iterate through this subset. Of course I could just reinstantiate it, using indices = subset_indices, and all the parameters that I have passed to it initially, or I can just setattr the indices variable, which would result in a cleaner code. The workaround that I am doing is mentioned in this issue - https://github.com/libffcv/ffcv/issues/316 - but as I have mentioned there, it seems like there are some problems with that. This would just make my codebase more convenient and easier to use, for my purposes.