Coltran: Need help on Flow for training and evaluation on custom dataset

ketan-lambat commented 2 years ago

How long do I need to train on a custom dataset?
How do I know if the training is complete?
How to get the result/output for a custom dataset?
How to calculate the FID?

I am using this Notebook, for training on the cityscapes dataset. Link to Colab Notebook

I trained the model on colab (till GPU resource got exhausted), got around 23 checkpoints.

I1115 14:34:04.833563 140478266419072 run.py:286] Saved checkpoint to /content/drive/MyDrive/Colab_Work/HONORS/ColTran-v2/google-research/coltran/logs/cityscapes_ckpt/model-20
I1115 14:37:47.372999 140478266419072 run.py:282] Loss: 0.549 bits/dim, Speed: 0.449 steps/second
I1115 14:41:29.732624 140478266419072 run.py:282] Loss: 0.507 bits/dim, Speed: 0.450 steps/second
I1115 14:45:12.382166 140478266419072 run.py:282] Loss: 0.507 bits/dim, Speed: 0.449 steps/second
I1115 14:48:54.532032 140478266419072 run.py:282] Loss: 0.497 bits/dim, Speed: 0.450 steps/second
I1115 14:52:36.791089 140478266419072 run.py:282] Loss: 0.529 bits/dim, Speed: 0.450 steps/second
I1115 14:52:40.168719 140478266419072 run.py:286] Saved checkpoint to /content/drive/MyDrive/Colab_Work/HONORS/ColTran-v2/google-research/coltran/logs/cityscapes_ckpt/model-21
I1115 14:56:22.615087 140478266419072 run.py:282] Loss: 0.530 bits/dim, Speed: 0.450 steps/second
I1115 15:00:05.021047 140478266419072 run.py:282] Loss: 0.522 bits/dim, Speed: 0.450 steps/second
I1115 15:03:47.351305 140478266419072 run.py:282] Loss: 0.489 bits/dim, Speed: 0.450 steps/second
I1115 15:07:29.647610 140478266419072 run.py:282] Loss: 0.528 bits/dim, Speed: 0.450 steps/second
I1115 15:11:11.850597 140478266419072 run.py:282] Loss: 0.550 bits/dim, Speed: 0.450 steps/second
I1115 15:11:15.348102 140478266419072 run.py:286] Saved checkpoint to /content/drive/MyDrive/Colab_Work/HONORS/ColTran-v2/google-research/coltran/logs/cityscapes_ckpt/model-22
I1115 15:14:58.002671 140478266419072 run.py:282] Loss: 0.542 bits/dim, Speed: 0.449 steps/second
I1115 15:18:40.301140 140478266419072 run.py:282] Loss: 0.501 bits/dim, Speed: 0.450 steps/second
I1115 15:22:22.593688 140478266419072 run.py:282] Loss: 0.528 bits/dim, Speed: 0.450 steps/second
I1115 15:26:05.144472 140478266419072 run.py:282] Loss: 0.499 bits/dim, Speed: 0.449 steps/second

How long / which step / loss value should I continue to train?
Any parameters which need to be changed? (I did change the batch size as mentioned here #838)

Next, after training on a custom dataset, how to evaluate the model or obtain the results of colorized/recolorized images?

I used this cmd, but I guess it works only for imagenet dataset.

python -m coltran.run --config=coltran/configs/colorizer.py --mode=eval_valid --logdir=$LOGDIR --dataset=custom --data_dir=$EVAL_DATA_DIR

Now, I am trying to use this (see the notebook for the next 2 steps)

python -m coltran.custom_colorize --config=coltran/configs/colorizer.py --logdir=$LOGDIR --img_dir=$IMG_DIR --store_dir=$STORE_DIR --mode=$MODE

Can someone please tell me if I am following the correct commands for getting the output? A step-by-step guide would be appreciated. I am getting confused about which flow to follow.

Is there any recommended way or package to calculate the FID scores to compare the results with the paper?

Also, the paper mentions 3 different coloured outputs for one input bnw image. How to get such results?

gezhaoDL commented 2 years ago

For FID，you can refer this repo: https://github.com/bioinf-jku/TTUR

levtelyatnikov commented 2 years ago

Hello, have you found a solutions for ur questions? Can you share the notebook for custom dataset?

ketan-lambat commented 2 years ago

Hello, have you found a solutions for ur questions? Can you share the notebook for custom dataset?

I have provided a link to the notebook above. Posting again, Link to Colab Notebook

MechCoder commented 2 years ago

Also, the paper mentions 3 different coloured outputs for one input bnw image. How to get such results?

You can just run the sampling 3 times, and it will give you 3 different results.

MechCoder commented 2 years ago

The sampling is stochastic by default, so each run should give you different results. See https://github.com/google-research/google-research/blob/master/coltran/models/colorizer.py#L282

MechCoder commented 2 years ago

How long do I need to train on a custom dataset?

The longer you train, the results will be better. I would use the maximum batch-size that fits in memory and train for around 500K steps. There should be a train_summaries subdirectory in /content/drive/MyDrive/Colab_Work/HONORS/ColTran-v2/google-research/coltran/logs/cityscapes_ckpt. For a sanity check, you could point tensorboard to this directory to see if the train loss goes down.

How to get the result/output for a custom dataset?

For a custom dataset, as ling as your dataset directory is supported by tf.io.decode_image (https://www.tensorflow.org/api_docs/python/tf/io/decode_image), it should work.

till GPU resource got exhausted

Btw, the GPU should not OOM during training. That seems a bit weird.

MechCoder commented 2 years ago

Here is the FID Script that I used for ImageNet. Hope you can adapt it to your dataset. I used the TFGAN implementation. (https://github.com/tensorflow/gan/blob/99bb93042520040dac401237616c10e54ab80a9f/tensorflow_gan/python/eval/inception_metrics.py#L130)

def normalize(x):
  # inception checkpoints expects inputs to be in [-1, 1]
  # https://codesearch.corp.google.com/piper///depot/google3/third_party/py/tensorflow_gan/examples/cifar/eval_lib.py?dr=CSs&g=0&l=34.
  # https://codesearch.corp.google.com/piper///depot/google3/third_party/py/tensorflow_gan/examples/cifar/data_provider.py?dr=CSs&g=0&l=40
  x = tf.squeeze(x['image'], axis=0)
  logging.info(x.shape)
  x = tf.to_float(x)
  # Normalize from [0, 255] to [-1.0, 1.0]
  x = (x / 128.0) - 1.0
  return x

# Real dataset.
real_dataset = datasets.get_dataset(
    name=FLAGS.dataset, subset='test', config=config, batch_size=1)
real_dataset = real_dataset.map(normalize, num_parallel_calls=100)
real_dataset = real_dataset.skip(FLAGS.samples)
real_dataset = real_dataset.batch(batch_size=FLAGS.batch_size)
real_dataset = real_dataset.skip(skip_samples // FLAGS.batch_size)
real_iterator = tf.compat.v1.data.make_initializable_iterator(real_dataset)
real_dataset = real_iterator.get_next()

gen_dataset = datasets.get_dataset(
    name=FLAGS.dataset, subset='test', config=config, batch_size=1)
gen_dataset = gen_dataset.map(normalize, num_parallel_calls=100)
gen_dataset = gen_dataset.batch(batch_size=FLAGS.batch_size)
gen_iterator = tf.compat.v1.data.make_initializable_iterator(gen_dataset)
gen_dataset = gen_iterator.get_next()

fid_stream = tfgan.eval.frechet_inception_distance_streaming
distance, update_op = fid_stream(real_dataset, gen_dataset)
logging.info(distance)
logging.info(update_op)

batch_size = FLAGS.batch_size
with tf.Session() as sess:
  init_ops = ([real_iterator.initializer, gen_iterator.initializer,
               tf.initialize_local_variables()])
  sess.run(init_ops)

  for epoch in range(1, num_epochs + 1):
    sess.run(update_op)

    if epoch % 10 == 0:
      dist_np = sess.run(distance)
      fid_str = f'Number of samples: {epoch * batch_size}, fid: {dist_np}'
      logging.info(fid_str)
  distance_np = sess.run(distance)
  logging.info(distance_np)

ketan-lambat commented 2 years ago

@MechCoder Thanks a lot for this. I am around halfway done through making one for my dataset, actually running into errors related to TF versions. Hopefully, should be able to resolve this with some more efforts.

Sorry to bother with some more questions

real_dataset = datasets.get_dataset( name=FLAGS.dataset, subset='test', config=config, batch_size=1)

What is the config file that is passed as input to this and to gen_dataset as well?

Also, Is it okay to use these FID implementations? https://github.com/mseitzer/pytorch-fid https://github.com/toshas/torch-fidelity I have used them to get results for some other models and were pretty straightforward to use.

MechCoder commented 2 years ago

No problem, happy to help.

Also, Is it okay to use these FID implementations?

I think it should be okay as long as you apply the same type of cropping to both generate and evaluate the images. We use central cropping to convert the high-res images into 256x256 (https://github.com/google-research/google-research/blob/master/coltran/datasets.py#L37)

What is the config file that is passed as input to this and to gen_dataset as well?

The config in my script is just config = {'resolution': [FLAGS.resolution, FLAGS.resolution]}. The above code snippet is to compute the baseline FID between two sets of ground truth images.

ketan-lambat commented 2 years ago

I custom trained the 3 models, Colorizer, Color Upsampler and Spatial Upsampler for a custom dataset. Then used the custom_colorize script to get the results. The First 2 stages went smooth, got the following error for the 3rd Spatial Upsampler step.

I used this command

!python -m coltran.custom_colorize --config=coltran/configs/spatial_upsampler.py \
--logdir=$SPATIAL_UPSMPLR_LOGDIR --img_dir=$IMG_DIR --store_dir=$STORE_DIR \
--gen_data_dir=$STORE_DIR/stage2 --mode=$MODE

I am using google colab. Is this because of colab GPU limits? Any help on how to solve this issue?

2022-02-16 08:20:45.489115: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/moving_averages.py:548: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0216 08:20:51.389240 140616530913152 deprecation.py:343] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/moving_averages.py:548: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0216 08:20:52.797949 140616530913152 train_utils.py:91] Built with exponential moving average.
I0216 08:20:52.813167 140616530913152 train_utils.py:185] Restoring from /content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/logs/cityscapes_ft_spatial_upsampler.
I0216 08:20:56.561913 140616530913152 custom_colorize.py:207] Producing sample after 37600 training steps.
I0216 08:20:56.562508 140616530913152 custom_colorize.py:210] 100
2022-02-16 08:21:08.242997: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.25GiB (rounded to 1342177280)requested by op Softmax
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-02-16 08:21:08.243377: W tensorflow/core/common_runtime/bfc_allocator.cc:474] *__**_****_*******________********************************************_______*************__________
2022-02-16 08:21:08.246940: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at softmax_op_gpu.cu.cc:219 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[5,256,256,4,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/custom_colorize.py", line 244, in <module>
    app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/custom_colorize.py", line 227, in main
    out = model.sample(gray_cond=gray, inputs=prev_gen, mode='argmax')
  File "/content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/models/upsampler.py", line 254, in sample
    logits = self.upsampler(inputs, gray_cond, training=False)
  File "/content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/models/upsampler.py", line 245, in upsampler
    context = self.encoder(channel, training=training)
  File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/models/layers.py", line 668, in call
    output = layer(inputs)
  File "/content/drive/MyDrive/Colab_Work/HONORS/coltran-v3/coltran-cityscapes-v2-finetune-3/google-research/coltran/models/layers.py", line 611, in call
    weights = tf.nn.softmax(alphas)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Exception encountered when calling layer "self_attention_nd" (type SelfAttentionND).

OOM when allocating tensor with shape[5,256,256,4,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Softmax]

Call arguments received:
  • inputs=tf.Tensor(shape=(5, 256, 256, 512), dtype=float32)
CPU times: user 294 ms, sys: 59.3 ms, total: 353 ms
Wall time: 47.6 s

MechCoder commented 2 years ago

try setting batch size=1, in the spatial upsampler config?

ketan-lambat commented 2 years ago

This is my coltran/configs/spatial_upsampler.py Seems the batch_size is 1

# coding=utf-8
# Copyright 2021 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Test configurations for color upsampler."""
from ml_collections import ConfigDict

def get_config():
  """Experiment configuration."""
  config = ConfigDict()

  # Data.
  config.dataset = 'imagenet'
  config.downsample = True
  config.downsample_res = 64
  config.resolution = [256, 256]
  config.random_channel = True

  # Training.
  config.batch_size = 1
  config.max_train_steps = 300000
  config.save_checkpoint_secs = 900
  config.num_epochs = -1
  config.polyak_decay = 0.999
  config.eval_num_examples = 20000
  config.eval_batch_size = 16
  config.eval_checkpoint_wait_secs = -1

  config.optimizer = ConfigDict()
  config.optimizer.type = 'rmsprop'
  config.optimizer.learning_rate = 3e-4

  # Model.
  config.model = ConfigDict()
  config.model.hidden_size = 512
  config.model.ff_size = 512
  config.model.num_heads = 4
  config.model.num_encoder_layers = 3
  config.model.resolution = [64, 64]
  config.model.name = 'spatial_upsampler'

  config.sample = ConfigDict()
  config.sample.gen_data_dir = ''
  config.sample.log_dir = 'samples_sweep'
  config.sample.batch_size = 1
  config.sample.mode = 'argmax'
  config.sample.num_samples = 1
  config.sample.num_outputs = 1
  config.sample.skip_batches = 0
  config.sample.gen_file = 'gen0'

  return config

MechCoder commented 2 years ago

Does this comment fix your issue? (https://github.com/google-research/google-research/issues/838#issuecomment-930699980)

ketan-lambat commented 2 years ago

Yes, Thanks. I feel stupid for not checking that before. 🤦‍♂️😂

ketan-lambat commented 2 years ago

ImageNet FID Scores not matching with the paper

What I did

Used the pretrained checkpoints given
Finetune them on a custom Imagenet dataset (a smaller version of 1000 train images) for all 3 steps, for around 10 epochs using coltran.run script
Generate output using coltran.custom_colorize script
Calculate FID using this package pytorch-fid

Got FID score of around 59 while the FID score mentioned in paper is around 19.

While calculating FID, both ground truth and generated images (count 436) are of res 256x256.

GroundTruth Images

Generated Images

Am I missing any step or anything I am not taking into consideration?
Anything to do with the fid package used? (I tried the script mentioned earlier here, but TF versions are creating issues and this one is a lot simpler to use)

google-research / google-research