lukalabs / cakechat

CakeChat: Emotional Generative Dialog System
Apache License 2.0
1.7k stars 935 forks source link

Quick start for GPU version: #67

Closed xandyxor closed 5 years ago

xandyxor commented 5 years ago

I got an error after using the command:

docker pull lukalabs/cakechat-gpu:latest && \
nvidia-docker run --name cakechat-gpu-server -p 127.0.0.1:8080:8080 -it lukalabs/cakechat-gpu:latest bash -c "CUDA_VISIBLE_DEVICES=0 python bin/cakechat_server.py"

Output:

2019-08-18 14:09:11.496399: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************************************************************xxxxxxxxxxxxx
2019-08-18 14:09:11.496449: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at random_op.cc:202 : Resource exhausted: OOM when allocating tensor with shape[50000,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[768,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node decoder_scope/decoder_1/random_uniform/RandomUniform}} = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5561963, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder_scope/decoder_1/random_uniform/shape)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/cakechat_server.py", line 11, in <module>
    from cakechat.api.v1.server import app
  File "/root/cakechat/cakechat/api/v1/server.py", line 3, in <module>
    from cakechat.api.response import get_response
  File "/root/cakechat/cakechat/api/response.py", line 14, in <module>
    _cakechat_model = get_trained_model(reverse_model=get_reverse_model(PREDICTION_MODE))
  File "/usr/local/lib/python3.5/dist-packages/cachetools/__init__.py", line 46, in wrapper
    v = func(*args, **kwargs)
  File "/root/cakechat/cakechat/dialog_model/factory.py", line 76, in get_trained_model
    model.init_model()
  File "/root/cakechat/cakechat/dialog_model/keras_model.py", line 30, in wrapper
    return func(*args, **kwargs)
  File "/root/cakechat/cakechat/dialog_model/keras_model.py", line 279, in init_model
    self.print_weights_summary()
  File "/root/cakechat/cakechat/dialog_model/keras_model.py", line 30, in wrapper
    return func(*args, **kwargs)
  File "/root/cakechat/cakechat/dialog_model/keras_model.py", line 263, in print_weights_summary
    weights = self._model.get_weights()
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/network.py", line 492, in get_weights
    return K.batch_get_value(weights)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 2420, in batch_get_value
    return get_session().run(ops)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 206, in get_session
    session.run(tf.variables_initializer(uninitialized_vars))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[768,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node decoder_scope/decoder_1/random_uniform/RandomUniform (defined at /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:4139)  = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5561963, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder_scope/decoder_1/random_uniform/shape)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'decoder_scope/decoder_1/random_uniform/RandomUniform', defined at:
  File "bin/cakechat_server.py", line 11, in <module>
    from cakechat.api.v1.server import app
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/root/cakechat/cakechat/api/v1/server.py", line 3, in <module>
    from cakechat.api.response import get_response
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/root/cakechat/cakechat/api/response.py", line 14, in <module>
    _cakechat_model = get_trained_model(reverse_model=get_reverse_model(PREDICTION_MODE))
  File "/usr/local/lib/python3.5/dist-packages/cachetools/__init__.py", line 46, in wrapper
    v = func(*args, **kwargs)
  File "/root/cakechat/cakechat/dialog_model/factory.py", line 76, in get_trained_model
    model.init_model()
  File "/root/cakechat/cakechat/dialog_model/keras_model.py", line 30, in wrapper
    return func(*args, **kwargs)
  File "/root/cakechat/cakechat/dialog_model/keras_model.py", line 277, in init_model
    self._model = self._build_model()
  File "/root/cakechat/cakechat/dialog_model/model.py", line 253, in _build_model
    decoder_training_model, decoder_model = self._decoder(y_tokens_emb_model, condition_emb_model)
  File "/root/cakechat/cakechat/dialog_model/model.py", line 412, in _decoder
    (outputs_seq_0, initial_state=dec_hs_1)
  File "/usr/local/lib/python3.5/dist-packages/keras/layers/recurrent.py", line 570, in __call__
    output = super(RNN, self).__call__(full_input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "/usr/local/lib/python3.5/dist-packages/keras/layers/cudnn_recurrent.py", line 237, in build
    constraint=self.kernel_constraint)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 249, in add_weight
    weight = K.variable(initializer(shape),
  File "/usr/local/lib/python3.5/dist-packages/keras/initializers.py", line 218, in __call__
    dtype=dtype, seed=self.seed)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 4139, in random_uniform
    dtype=dtype, seed=seed)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/random_ops.py", line 243, in random_uniform
    rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 733, in random_uniform
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[768,2304] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node decoder_scope/decoder_1/random_uniform/RandomUniform (defined at /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:4139)  = RandomUniform[T=DT_INT32, dtype=DT_FLOAT, seed=87654321, seed2=5561963, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder_scope/decoder_1/random_uniform/shape)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

But cpu version is ok.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nicolas-ivanov commented 5 years ago

@xandyxor The first line of output reads "Resource exhausted: OOM when allocating tensor with shape[50000,128] and type float".

OOM stands for "Out of memory". What is the size of RAM on your GPU?

In order to decrease memory usage you may set lower params OUTPUT_SEQUENCE_LENGTH and SAMPLES_NUM_FOR_RERANKING in config file.

See a similar issue: https://github.com/lukalabs/cakechat/issues/66