tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.65k stars 1.65k forks source link

event files too large when TF graph contains constant data #1252

Open martin-gorner opened 6 years ago

martin-gorner commented 6 years ago

TB 1.8.0 OS: any Python version: any

Tensorboard event file abnormally large when the TF graph contains data defined with tf.constant. For example for MNIST the event file will be almost 1GB if the MNIST dataset is loaded with tf.constant. The mnist dataset is 50MB uncompressed.

Storing this data in the event files is never useful. In the Tensorboard interface, only small constants are visible. It is not planned to make large constant arrays visible as it would not be a useful feature. Large constant arrays should be dramatically truncated when stored in Tensorboard event files, or not stored at all.

nfelt commented 6 years ago

@martin-gorner Could you provide a sample event file or ideally sample code that produces the event file with the large graph constants?

martin-gorner commented 6 years ago

Sure, here you go (I changed my MNIST sample to load the data from file properly but here is an old version that works with TF 1.8 and shows the problem):

git clone https://github.com/martin-gorner/tensorflow-mnist-tutorial.git cd tensorflow-mnist-tutorial git reset --hard a443df25ef779c54c6c03397dcda9cae0ac8f2c2 cd mlengine/ python3 trainer/task.py

A 700MB events file will appear in a directory called "checkpoints" a couple of seconds later.

*after "git reset" you should see the following message: "HEAD is now at a443df2 added Datasets API" **python3 but the code works with python 2 as well

martin-gorner commented 6 years ago

and if you reset to a later version that loads the data from file progressively, the event file becomes less than 1MB.

git pull git reset --hard 9af70c1dda1184e1803498c2a4b882577cfe2528 python3 trainer/task.py

A 1MB events file will appear in a directory called "checkpoints" a couple of seconds later.

nfelt commented 6 years ago

Thanks for the examples, this was very helpful and I can reproduce the issue. In this case it looks like both the GraphDef and MetaGraphDef are about 375 MB each. I expect (but have not confirmed) that the 50MB of uncompressed MNIST data expands to 375MB due to padding or other overhead in the representation.

FWIW, this is somewhat orthogonal to the large constant issue, but I observed that the issue gets exacerbated as training continues because in addition to logging graphdef/metagraphdef when the FileWriter gets opened, CheckpointSaverHook logs them every time a new session is created. The local run option for Estimator.train_and_evaluate() apparently was creating sessions for new checkpoints as often as once per second, so this was logging extra graphdef/metagraphdef copies to the same events file needlessly. Luckily it looks like this should be fixed by cl/201085814 once that gets pushed out (it's not yet in tf-nightly so I haven't tested this yet).

jmoraleda commented 6 years ago

I am not saving any images and only a few scalar values and still getting huge tf event files (on the order of hundreds of gigabytes per file)

I am using a canned DNNClassifier estimator in version 1.8 invoked using the train_and_evaluate function.

I gather from the above comments that training data is saved to event files, so I suspect this has something to do with it.

I have narrowed the problem down until the only thing that I am doing different from tutorials is in my input_fn: I am using from_tensor_slices since my current dataset is very small (300000 points with dimension 200), and I thought randomizing it all at once every time input_fn is invoked (every few epochs) would save time:

This is my actual input_fn, which gets returned from a member function of a Dataset factory class. (invoked with start and end equal to 0 and 0.8 for training and 0.8and 0.9 for evaluation, to use the first 80% of the arrays for training and the next 10% for evaluation (saving the final 10% for final testing)). self.inputData is a dict with a single entry: featureName: numpy array of shape (300000,200). and self.zeroIndexedLabels is a numpy array of shape(300000,1)

def actual_input_fn():

            numData = len(self.zeroIndexedLabels)
            r = np.arange(int(start * numData), int(end * numData))

            if mode == tf.estimator.ModeKeys.TRAIN:
                LOGGER.info("Randomizing dataset with state {} ...".format(np.random.get_state()[1][0:9]))
                np.random.shuffle(r)

            tensorSlices = ({n:v[r, :] for n, v in self.inputData.items()}, self.zeroIndexedLabels[r, :])

            dataset = tf.data.Dataset.from_tensor_slices(tensorSlices)

            if mode == tf.estimator.ModeKeys.TRAIN:
                dataset.repeat()

            dataset = dataset.batch(batchSize)

            return dataset

Everything works fine with the above except for the enormous tfevent files, so I am wondering if this is a bug in tensorflow or if I am doing something conceptually wrong.

I would be happy to provide one or more of these large event files, if it would help.

nfelt commented 6 years ago

@jmoraleda Yes, I think you're also affected by this issue. When using tf.data.Dataset.from_tensor_slices() this will result in the input data being stored as a constant in your graph (which is then written to the events file), and as the docs describe here this works best for datasets that are quite small. Even though your dataset is modest by some standards, 300,000 points of dimension 200, assuming these are int32 values, is about 250 MB and in practice could be several GB due to padding.

Are you running train_and_evaluate() locally? If so, then as mentioned above, that logic exacerbates the issue by appending the graph to the events file once per checkpoint and checkpointing very aggressively by default. This issue should be addressed in TF 1.10 when it comes out, and is fixed in current tf-nightly, so switching to tf-nightly should reduce your event files to a few GB or less.

As described in the original comment above, truncating large constants when writing the graph to the event file would further reduce the file size, and that's something we'll look into.

jmoraleda commented 6 years ago

@nfelt Thank you! Following your pointer and suggestion I ended up refactoring my code to use tf.estimator.inputs.numpy_input_fn instead of directly invoking from_tensor_slices. (I also explored writing my own implementation to use placeholders and then to feed the arrays into tensors one batch at a time, but I felt I was just writing a basic implementation of the numpy_input_fn code).

This completely solved the issue with the size of tfevent files, plus my code is noticeably faster (presumably because there is a lot less copying going on, which I did not realize was happening before).

Thank you again.

SystemErrorWang commented 5 years ago

I have the same issue without calling tf.data.Dataset.from_tensor_slices() api. I guess it may be because I loaded a pretrained vgg-19 model. would like to know if there's anyway to reduce the file size

mhajiaghayi commented 5 years ago

I have a similar problem but i'm using custom estimator and input_fn reading data from a generator. Everything works fine except this huge tensorboard file that I think it also reduces the speed. When I increase save_summary_steps parameter, this problem doesn't go away. This is the info from my machine:

` def inputFn(self,tasks, mode, params= None):

generate dataset from generator func

    types = ((tf.string,tf.string,tf.float32),tf.float32)
    shapes = ((None,None,None),None)
    datasets = []
    for task in tasks:
        dataset = tf.data.Dataset.from_generator(
                            functools.partial(task.relevanceDb.batchGen,self.args,mode),
                            output_shapes=shapes, output_types=types)
        datasets.append(dataset)
    if tf.__version__=="1.8.0":
        multiDataset = datasets[0]
    else:
        choice = tf.data.Dataset.range(len(datasets)).flat_map(self.getBatch).repeat(self.args.maxSteps)
        multiDataset = tf.data.experimental.choose_from_datasets(datasets,choice)
    multiDataset = multiDataset.batch(self.args.batchSize).prefetch(1) `
Slyne commented 4 years ago

Any update ?

ritou11 commented 3 years ago

@nfelt Thank you! Following your pointer and suggestion I ended up refactoring my code to use tf.estimator.inputs.numpy_input_fn instead of directly invoking from_tensor_slices. (I also explored writing my own implementation to use placeholders and then to feed the arrays into tensors one batch at a time, but I felt I was just writing a basic implementation of the numpy_input_fn code).

This completely solved the issue with the size of tfevent files, plus my code is noticeably faster (presumably because there is a lot less copying going on, which I did not realize was happening before).

Thank you again.

Thank you. This is working for me. Also it resolves my memory leak problem. Everything seems fine now.

pindinagesh commented 2 years ago

@martin-gorner

Could you please respond on the above @ritou11 's comment .Thanks