google-research / mixmatch

Apache License 2.0
1.13k stars 163 forks source link

How to use MixMatch with custom dataset? #3

Closed varunnair18 closed 5 years ago

varunnair18 commented 5 years ago

Do you have plans to make this repository compatible with a custom dataset, and if not, which files would need to be modified to do so?

Additionally, were the tfrecords generated using the standard generate scripts present in the tensorflow library (ex: generate_cifar10_tfrecords.py from tensorflow/models/blob/master/tutorials/image/cifar10_estimator/) ?

Generally speaking, I noticed that the mechanisms for loading the dataset (libml/data_pair.py, libml/data.py), augmenting the dataset, and performing evaluations would have to be changed to accommodate for a different dataset.

david-berthelot commented 5 years ago

No I don't have plans to change this repository, basically it's designed to reproduce the results from the paper. The reason it is on GitHub, beyond making our results easily reproducible, is so that anyone can fork it, make it better, etc...

Adding datasets should relatively easy, the datasets are created in https://github.com/google-research/mixmatch/blob/master/scripts/create_datasets.py The data is stored in a tfrecord file, and the records are tf.Example entries (save the save functions)

image: <byte string of png encoded image>
label: <int64 the label id>

Then, you as you already discovered, just hook the dataset to data.py and data_pair.py. There is nothing really special here except specifying the image size and the number of classes. You can reuse one of the basic augmentations or customize one to your needs.

varunnair18 commented 5 years ago

Thanks for that info! I'll give it a try and let you know if I'm able to successfully apply it to my dataset.

Regarding the fully supervised baselines presented in the paper, are there ways to replicate those as well? If that functionality exists in the repo, I would interested in using it to find a supervised baseline for my data as well.

david-berthelot commented 5 years ago

Yes the fully supervised baselines are in the fully_supervised folder, and to reproduce the tables in the paper checks the script(s) in fully_supervised/runs.