Open tjingrant opened 6 years ago
As you can see the verbosity is set to 1, it's the same value when I trained Resnet50 on ImageNet.
I beleive the issue is that in preprocessing.py, we have a tf.constant
node with all the images, which takes up more than a gigabyte of space in the graph.pbtxt that's written out to disk.
This is not a priority for us, so marking as contributions welcome if anyone wants to work on it. To solve this, we should store the images in a variable, and initialize the variable with the images without storing the images in the graph. One of doing this is using a feed dict. Another way would be using tf.data
.
@reedwm thanks, I could look into this one, so the graph definition is also written to the tfevent logs?
My concern is whether this solves huge tfevent file problem...
You're right, the events.out.tfevents
is also huge. I'm not sure whether the graph definition is written to it. The issue goes away if I omit --data_dir
.
Also, in TensorBoard, I was able to see the scalar by clicking "Inactive", then "Scalars."
@jsimsa can you shed some light into this situation? Since your name appeared in Cifar10ImagePreprocessor class...
Doing what @reedwm suggests makes sense.
@jsimsa , I wonder if doing what @reedwm suggests could also solve the problem for huge tfevents file?
I have confirmed the tf.constant
node is also causing large tfevents file, by changing the line all_images = tf.constant(all_images)
to all_images = tf.constant(all_images[:200, ...])
in preprocessing.py and seeing that the size of tfevents was only a few megabytes.
IMO, changing the constant to a variable and initializing it using a feed dict is the easiest way to solve this problem.
@reedwm thanks a lot for looking into this issue, I'll see what I can do.
@tjingrant If you are only doing one GPU I suggest checking out the official ResNet CIFAR-10 example. It is better maintained and uses more standard TensorFlow concepts.
https://github.com/tensorflow/models/tree/master/official/resnet
There will be a multi-GPU example very soon using estimator. You are welcome to use the benchmark example and I want you to know there is another possibly easier to follow and more fun to use example in the model garden.
Hi,
I'm trying to train resnet56 on CIFAR-10 with the following param. However, each time I start the run, it creates a log file of size 1.2G or 2.4G. Somehow I have to constantly restart this training and it quickly grows to unmanageable. In contrast, each log file Resnet50 creates on ImageNet dataset is around 30-40MB which is much more manageable...
Also when I view them with Tensorboard, I can only see projector, not scalars like loss.
Do you have any idea what's going on?