Open wchargin opened 5 years ago
cc @stephanwlee
I’ve triaged this as high priority because it makes logdirs quadratically sized. I have a logdir that should be 50MB but is 2GB.
Few things I noticed:
ops
for defining graph. All model calls are basically adding onto the same graph.FuncGraph('keras_graph')
but I do lack knowledge in how this is actually used. Specifically, how is the nodes in this graph added when instantiating a Keras layer? In any case, this should have the same problem as (2) as it is sharing the same FuncGraph
.tf.compat.v1.reset_default_graph()
) before calling the model func and it resulted in below:
run 0: 75408 bytes
run 1: 75408 bytes
run 2: 75408 bytes
run 3: 75408 bytes
run 4: 75408 bytes
run 5: 75408 bytes
run 6: 75408 bytes
run 7: 75408 bytes
run 8: 75408 bytes
run 9: 75408 bytes
Funnily this has implications to the execution time too :)
# BEFORE
1000/1000 [==============================] - 0s 191us/sample - loss: 2.3172
1000/1000 [==============================] - 0s 154us/sample - loss: 2.3278
1000/1000 [==============================] - 0s 168us/sample - loss: 2.3309
1000/1000 [==============================] - 0s 174us/sample - loss: 2.3373
1000/1000 [==============================] - 0s 207us/sample - loss: 2.3350
1000/1000 [==============================] - 0s 235us/sample - loss: 2.3242
1000/1000 [==============================] - 0s 237us/sample - loss: 2.3371
1000/1000 [==============================] - 0s 247us/sample - loss: 2.3092
1000/1000 [==============================] - 0s 251us/sample - loss: 2.3562
1000/1000 [==============================] - 0s 268us/sample - loss: 2.3335
1000/1000 [==============================] - 0s 207us/sample - loss: 2.3126 1000/1000 [==============================] - 0s 141us/sample - loss: 2.3214 1000/1000 [==============================] - 0s 138us/sample - loss: 2.3141 1000/1000 [==============================] - 0s 154us/sample - loss: 2.3143 1000/1000 [==============================] - 0s 140us/sample - loss: 2.3242 1000/1000 [==============================] - 0s 147us/sample - loss: 2.3290 1000/1000 [==============================] - 0s 144us/sample - loss: 2.3246 1000/1000 [==============================] - 0s 134us/sample - loss: 2.3250 1000/1000 [==============================] - 0s 148us/sample - loss: 2.3380 1000/1000 [==============================] - 0s 137us/sample - loss: 2.3268
WARNING: this is not a replacement for benchmark but I think I see the trend :)
@omalleyt12 is my assessment correct? Also, can you shine some light on (3)? Thanks!
Taylor / Tom, can you take a look and update?
You are correct; all keras models share a single graph. (Unless called under an explicit graph scope, in which case they will use that instead.) The primary reason for this is that you can mix and match models, so the ops have to live on the same graph. However this does indeed introduce issues as you accrue more and more orphaned stuff. For now the solution is to call tf.keras.backend.clear_session
. (Despite the name, it works in 1.x and 2.0) This just nukes everything and starts over.
I am currently working on a prototype to break up the keras global graph. One of the key motivating factors is our current inability to garbage collect without clearing everything. I don't have an estimate for the timeline, but I'll keep you appraised.
Running many unrelated Keras models in one Python script leads to steadily increasing event file sizes, even when the models and callbacks are unrelated to each other.
I noticed this because I ran 200 runs overnight, and wondered why the event file for the first one was 220KB while the event file for the last run was 20MB.
Run the following script in
tf-nightly-2.0-preview
in Python 3 (in a directory where you don’t mind thelogs
directory being erased):Then run
wc -c logs/**/train/*
:In each case, the graph makes up the vast majority of the event file (all but a kilobyte or so):