tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.14k stars 632 forks source link

Resnet56 with CIFAR-10 produces huge log file #128

Open tjingrant opened 6 years ago

tjingrant commented 6 years ago

Hi,

I'm trying to train resnet56 on CIFAR-10 with the following param. However, each time I start the run, it creates a log file of size 1.2G or 2.4G. Somehow I have to constantly restart this training and it quickly grows to unmanageable. In contrast, each log file Resnet50 creates on ImageNet dataset is around 30-40MB which is much more manageable...

Also when I view them with Tensorboard, I can only see projector, not scalars like loss.

Do you have any idea what's going on?

        --model=resnet56 \
        --batch_size=128 \
        --num_epochs=$i \
        --num_gpus=1 \
        --data_dir=/mnt/nfs/cifar-10/cifar-10-batches-py \
        --data_name=cifar10 \
        --variable_update="replicated" \
        --train_dir=resnet56/ \
        --all_reduce_spec=nccl \
        --print_training_accuracy=True \
        --optimizer="momentum" \
        --piecewise_learning_rate_schedule="0.1;250;0.01;375;0.001" \
        --momentum=0.9 \
        --weight_decay=0.0001 \
        --summary_verbosity=1 \
        --save_summaries_steps=200 \
        --save_model_secs=600 \
tjingrant commented 6 years ago

As you can see the verbosity is set to 1, it's the same value when I trained Resnet50 on ImageNet.

reedwm commented 6 years ago

I beleive the issue is that in preprocessing.py, we have a tf.constant node with all the images, which takes up more than a gigabyte of space in the graph.pbtxt that's written out to disk.

This is not a priority for us, so marking as contributions welcome if anyone wants to work on it. To solve this, we should store the images in a variable, and initialize the variable with the images without storing the images in the graph. One of doing this is using a feed dict. Another way would be using tf.data.

tjingrant commented 6 years ago

@reedwm thanks, I could look into this one, so the graph definition is also written to the tfevent logs?

My concern is whether this solves huge tfevent file problem...

reedwm commented 6 years ago

You're right, the events.out.tfevents is also huge. I'm not sure whether the graph definition is written to it. The issue goes away if I omit --data_dir.

Also, in TensorBoard, I was able to see the scalar by clicking "Inactive", then "Scalars."

tjingrant commented 6 years ago

@jsimsa can you shed some light into this situation? Since your name appeared in Cifar10ImagePreprocessor class...

jsimsa commented 6 years ago

Doing what @reedwm suggests makes sense.

tjingrant commented 6 years ago

@jsimsa , I wonder if doing what @reedwm suggests could also solve the problem for huge tfevents file?

reedwm commented 6 years ago

I have confirmed the tf.constant node is also causing large tfevents file, by changing the line all_images = tf.constant(all_images) to all_images = tf.constant(all_images[:200, ...]) in preprocessing.py and seeing that the size of tfevents was only a few megabytes.

IMO, changing the constant to a variable and initializing it using a feed dict is the easiest way to solve this problem.

tjingrant commented 6 years ago

@reedwm thanks a lot for looking into this issue, I'll see what I can do.

tfboyd commented 6 years ago

@tjingrant If you are only doing one GPU I suggest checking out the official ResNet CIFAR-10 example. It is better maintained and uses more standard TensorFlow concepts.

https://github.com/tensorflow/models/tree/master/official/resnet

There will be a multi-GPU example very soon using estimator. You are welcome to use the benchmark example and I want you to know there is another possibly easier to follow and more fun to use example in the model garden.