How to use on custom dataset?

sarmientoj24 commented 3 years ago

For example, I only want to use one class in OpenImages, say the car class and I would want to use the InceptionV3 embedding.

How do I use it on custom dataset?

TDeVries commented 3 years ago

It should be fairly easy to apply to custom datasets. You need to create a dataset object that contains your data and then pass it to the select_instances function. See https://github.com/uoguelph-mlrg/instance_selection_for_gans#applying-instance-selection-to-your-own-dataset or https://github.com/uoguelph-mlrg/instance_selection_for_gans/blob/master/instance_selection.py#L91. To use the InceptionV3 embedding pass 'inceptionv3' as the embedding arg.

If you only want a single class from a larger dataset (like OpenImages) you may need to create a custom dataset object that only loads images from that class. If all car images are in a single folder you could maybe use an ImageFolder dataset (https://pytorch.org/vision/stable/datasets.html#torchvision.datasets.ImageFolder).

sarmientoj24 commented 3 years ago

For the custom dataset, is it fine if I only have one class that I have already loaded from, say, a folder?

TDeVries commented 3 years ago

Yes, that should work fine.

kc-puttagunta commented 3 years ago

Yes, that should work fine.

is there a place from where I can access this subset of selected instances? I would like to observe these images and pass them on to StyleGAN2 as my training dataset. I have an unlabelled custom dataset in a single folder and I have applied the select_instances function to it with inceptionV3 embeddings but I don't see any output or the original folder being reduced. Thanks!

TDeVries commented 3 years ago

The select_instances function returns a Subset dataset object, so if you want to view the images it selected you could iterate through it to generate some samples sheets or even save each image separately to file, if that's what you are looking for. It works by keeping track of the indices from the original dataset that it needs for the reduced dataset, so your original folder won't be changed, and it doesn't save any new files.

If you want to save the indices that it selected you can pass the function a file path ending in .pkl (https://github.com/uoguelph-mlrg/instance_selection_for_gans/blob/master/instance_selection.py#L95), and that will save a pickle file of indices. However, that might be a bit tricky to interpret, since the ordering is with reference to the file paths in the original dataset object, which may not line up with how files are listed in your directory. You might be able to dig into the dataset attributes to find a list of file paths though. For example if your original dataset is a DatasetFolder or ImageFolder object then it should have a self.samples attribute containing file paths which lines up with the selected indices.

kc-puttagunta commented 3 years ago

The select_instances function returns a Subset dataset object, so if you want to view the images it selected you could iterate through it to generate some samples sheets or even save each image separately to file, if that's what you are looking for. It works by keeping track of the indices from the original dataset that it needs for the reduced dataset, so your original folder won't be changed, and it doesn't save any new files.

If you want to save the indices that it selected you can pass the function a file path ending in .pkl (https://github.com/uoguelph-mlrg/instance_selection_for_gans/blob/master/instance_selection.py#L95), and that will save a pickle file of indices. However, that might be a bit tricky to interpret, since the ordering is with reference to the file paths in the original dataset object, which may not line up with how files are listed in your directory. You might be able to dig into the dataset attributes to find a list of file paths though. For example if your original dataset is a DatasetFolder or ImageFolder object then it should have a self.samples attribute containing file paths which lines up with the selected indices.

thank you for a prompt response. this was very helpful. I am applying your code to solve a generation problem in the low-data regime and it has proven somewhat useful in bringing about convergence. I have a few questions to discuss around identifying clusters within the data to more precisely retain high density data belonging to a single homogenous cluster. currently, the retention ratio seems to be somewhat arbitrary and can only be validated after the generation task. if this piques your interest, are you available for a chat sometime? :) thanks in advance!

kc-puttagunta commented 3 years ago

also, would it help to extracts embeddings from more recent and deeper architectures that perform better than V3 and others?

uoguelph-mlrg / instance_selection_for_gans

How to use on custom dataset? #3