google-research / simclr

SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners
https://arxiv.org/abs/2006.10029
Apache License 2.0
4.06k stars 622 forks source link

Custom dataset usage #50

Closed ambekarsameer96 closed 4 years ago

ambekarsameer96 commented 4 years ago

hi @chentingpc can you please post the instructions/ guidelines when the code is used on a custom dataset. Any tips for the specific usage of code would also be helpful. Thanks.

chentingpc commented 4 years ago

one option is to use customized tensorflow datasets. the other is to replace tfds with your customized tf.data API, e.g. replace https://github.com/google-research/simclr/blob/01ddaf0bd692ee945dad7ff5fb07b26df1b9edbe/data.py#L133 with something like the following:

dataset = tf.data.Dataset.list_files(pattern)
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=num_readers, block_length=1)
ambekarsameer96 commented 4 years ago

Thanks. Can we run this code on Google colab such that we can make use of TPU on colab?

chentingpc commented 4 years ago

It should be able to run on colab with some modification, and you're welcomed to try and I will link your colab if you made it work.

There are some colabs examples in https://github.com/google-research/simclr/tree/master/colabs folder for fine-tuning (not pretraining).

ambekarsameer96 commented 4 years ago

Thanks. I will try it for pre-training.

ambekarsameer96 commented 4 years ago

Hi, was abe to run pre-train on colab using TPU for cifar dataset with model_dir and data_dir pointing to Google Cloud Storage Bucket. (output in this link - https://pastebin.com/i1608kRp )

But is there a way to make pretrain work on colab without the usage of GCS Bucket? I am aware that tpu_estimator expects model_dir to be a GCS path and not a local path. Can we make use of any other function as a replacement for tpu_estimator?

One of the ways is to convert tpu_estimator to Keras model and make use of keras.model_fit (using with strategy.scope(): where strategy = tf.distribute.experimental.TPUStrategy(resolver)) but for pertaining the pre-train stage doesn't make use of the usual training process instead makes use of custom training process therefore I am not sure if we can convert tpu_estimator to keras model.

Please let me know if there is a way to do this where we doesn't make use of storage buckets.

chentingpc commented 4 years ago

i am also not sure if there's another way here. sorry.

ambekarsameer96 commented 4 years ago

Ok. Thank you for your quick response!

tarunn2799 commented 4 years ago

Hey @chentingpc, I'm trying to do the same, and you'd mentioned changes in the data.py. But won't we have to make all the changes in run.py first since we're loading tfds there? Specifically, here in run.py (Line 341)

builder = tfds.builder(FLAGS.dataset, data_dir=FLAGS.data_dir)

chentingpc commented 4 years ago

if you're not using tfds, you can ignore/remove tfds.builder. Otherwise you can simply change the name of the dataset by selecting it from https://www.tensorflow.org/datasets/catalog/overview

ramdhan1989 commented 3 years ago

Hi, was abe to run pre-train on colab using TPU for cifar dataset with model_dir and data_dir pointing to Google Cloud Storage Bucket. (output in this link - https://pastebin.com/i1608kRp )

But is there a way to make pretrain work on colab without the usage of GCS Bucket? I am aware that tpu_estimator expects model_dir to be a GCS path and not a local path. Can we make use of any other function as a replacement for tpu_estimator?

One of the ways is to convert tpu_estimator to Keras model and make use of keras.model_fit (using with strategy.scope(): where strategy = tf.distribute.experimental.TPUStrategy(resolver)) but for pertaining the pre-train stage doesn't make use of the usual training process instead makes use of custom training process therefore I am not sure if we can convert tpu_estimator to keras model.

Please let me know if there is a way to do this where we doesn't make use of storage buckets.

can you share your code ? I am not successful to use custom dataset

thanks