OOM Error while running Main.py

learningneo commented 6 years ago

Hello, Thanks for providing the code. The requirement.txt is missing, but I was largely able to set the code and run it. I am facing runtime OOM error:

W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[32,1226,20003]
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 0 get requests, put_count=11679 evicted_count=9000 eviction_rate=0.770614 and unsatisfied allocation rate=0
Traceback (most recent call last):
  File "Main.py", line 230, in <module>
    main()
  File "Main.py", line 223, in main
    train(sess, dataloader, model)
  File "Main.py", line 91, in train
    loss += model(x, sess)
  File "/path/to/program/wiki2bio/SeqUnit.py", line 410, in __call__
    self.decoder_len: x['dec_len'], self.decoder_output: x['dec_out']})
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,1226,20003]
         [[Node: transpose_4 = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TensorArrayStack_1/TensorArrayGatherV3/_447, transpose_4/perm)]]
         [[Node: clip_by_global_norm/clip_by_global_norm/_8/_647 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_3467_clip_by_global_norm/clip_by_global_norm/_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'transpose_4', defined at:
  File "Main.py", line 230, in <module>
    main()
  File "Main.py", line 217, in main
    encoder_add_pos=FLAGS.encoder_pos, learning_rate=FLAGS.learning_rate)
  File "/path/to/program/wiki2bio/SeqUnit.py", line 130, in __init__
    de_outputs, de_state = self.decoder_t(en_state, self.decoder_embed, self.decoder_len)
  File "/path/to/program/wiki2bio/SeqUnit.py", line 237, in decoder_t
    outputs = tf.transpose(emit_ta.stack(), [1,0,2])
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1288, in transpose
    ret = gen_array_ops.transpose(a, perm, name=name)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3841, in transpose
    result = _op_def_lib.apply_op("Transpose", x=x, perm=perm, name=name)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/opt/.conda/envs/tabtotext/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,1226,20003]
         [[Node: transpose_4 = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](TensorArrayStack_1/TensorArrayGatherV3/_447, transpose_4/perm)]]
         [[Node: clip_by_global_norm/clip_by_global_norm/_8/_647 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_3467_clip_by_global_norm/clip_by_global_norm/_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

I am using a TITAN X GPU with 12 GiB of RAM. What parameters should be decreased/controlled to get the prototype running? Any help will be much appreciated.

learningneo commented 6 years ago

Hi, So even after changing the parameters target_vocab to 10000 and emb_size to 200 and batch_size to 24, still facing the OOM error. Please let me know if I am doing something wrong here or if there is any other parameters/parameter values which I need to try with, given my infrastructure.

berzentine commented 6 years ago

Were you able to solve this? We too are not able to run the released code end to end.

learningneo commented 6 years ago

Nah.. I tried to run it on a huge server with CPU - even there, got OOM error. So my only guess is the author is running on large multi GPU machines, but can't be sure at all. Waiting for the author/team to comment further, when ever possible.

tyliupku commented 6 years ago

Sorry for the late response!

It's very confused to me why tensorflow tries to allocate a tensor with shape [32,1226,20003]. Because we limit the length of input sequences as 100 in DataLoader.py (line 93-98).

if max_text_len > self.man_text_len:  # self.man_text_len = 100 
     text = text[:self.man_text_len]
     field = field[:self.man_text_len]
     pos = pos[:self.man_text_len]
     rpos = rpos[:self.man_text_len]
     text_len = min(text_len, self.man_text_len)

So the input sequence whose length is 1226 seems impossible (should by less than 100) in our configuration.

Actually, we never met OOM problem while running on a single GTX1080Ti GPU (11G RAM). It takes about 4.4G RAM to run our code with batch size 32.

erwin-d-austria commented 6 years ago

any update of the error?

erwin-d-austria commented 6 years ago

whati is the equivalent of of print 'stty' in windows?

littlefis commented 5 years ago

I also encountered this problem, but when I changed the batch_size to 16, it no longer appeared.

tyliupku / wiki2bio

OOM Error while running Main.py #1