tensorflow / similarity

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.
Apache License 2.0
1.01k stars 104 forks source link

[REQUEST] TFDatasetMultiShotMemorySampler for custom datasets #300

Open Lunatik00 opened 2 years ago

Lunatik00 commented 2 years ago

hi, I am testing using different dataflows for training, I have tested using the sampler and the dataset (using tf.keras.utils.image_dataset_from_directory), I have found that loading the data and feeding it to the sampler ends up with a very different max batch size for the same gpu, 20 in one, over 30 in other, the dataset is the one that can have the most, but the data is not divided in a good way per batch, so, I want to try the dataset as the input for the memory sampler, but the current function is made to only download a non custom dataset, I will try modifications to make it work but I don't think my code will be generic and I haven't used overload functions before so I leave this as a request that should be simple to implement.

owenvallis commented 1 year ago

Hi Lunatik00, apologies for the slow response. We currently support loading custom data using the MultiShotMemorySampler. The data is loaded into memory and properly sampled over the classes to ensure that the batches are created correctly. However, some datasets can be too large to hold in memory, e.g., larger image datasets.

Fortunately, we just had a recent PR that adds support for loading examples from disk, see here. You'll need to pass the paths to your examples as the x input and then the load function will take that path and load the example from disk when constructing the batches.

Hopefully this helps, but let me know if you run into issues.