textsum memory leak with distributed running

tensorflow / models

Models and examples built with TensorFlow

Other

77.18k stars 45.75k forks source link

textsum memory leak with distributed running #593

Closed caothanhha9 closed 8 years ago

caothanhha9 commented 8 years ago

textsum

When I re-config textsum module (seq2seq_attention.py) to run with several servers, memory in chief computer exceed 12GB (8GB RAM + 4GB SWAP). The main change like:

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":
    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):
    # with tf.device('/cpu:0'):
      model.build_graph()
      saver = tf.train.Saver()

drpngx commented 8 years ago

Not sure what the question is. Are you expecting less usage on the master?

drpngx commented 8 years ago

@panyx0718 @peterjliu question about textsum

panyx0718 commented 8 years ago

12G is not very large memory for such model

On Fri, Oct 28, 2016 at 12:44 PM, drpngx notifications@github.com wrote:

@panyx0718 https://github.com/panyx0718 @peterjliu https://github.com/peterjliu question about textsum

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/593#issuecomment-257010972, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwQe35z26ePRrEjm0I5QJNd6WjjVazwks5q4lCTgaJpZM4KjLLQ .

Thanks Xin

drpngx commented 8 years ago

So it works as expected.

caothanhha9 commented 8 years ago

@drpngx @panyx0718 Thank you very much!

caothanhha9 commented 8 years ago

@panyx0718 How much RAM should be sufficient? @drpngx as @panyx0718 explained, could you suggest me a way to reduce master (chief worker) memory usage? Does in-graph parallelism help? P/S: I'm sorry if the questions are more related to tensorflow distributed mechanism not much to textsum model. However I really want to use this model in parallel so I ask you here directly. Thank you very much!