Detecting multiple GPUs

VP007-py commented 3 years ago

I'm trying to run this on a cluster with N_GPUS=3 in the config.py.I tried to fix it from the here but the error persists. Any feedback ?

Using TensorFlow backend.
[11/08/2020 19:35:24] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
[11/08/2020 19:35:26] Running training.
[11/08/2020 19:35:26] Building mydata55_hien dataset
[11/08/2020 19:35:26]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:27] Creating vocabulary for data with data_id 'target_text'.
[11/08/2020 19:35:27]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[11/08/2020 19:35:27] Creating dictionary of all words
[11/08/2020 19:35:27] Loaded "train" set outputs of data_type "text-features" with data_id "target_text" and length 80000.
[11/08/2020 19:35:27]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:27] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 1800.
[11/08/2020 19:35:27]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:27] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 1000.
[11/08/2020 19:35:28]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:28] Creating vocabulary for data with data_id 'source_text'.
[11/08/2020 19:35:28]    Total: 42963 unique words in 80000 sentences with a total of 466800 words.
[11/08/2020 19:35:28] Creating dictionary of all words
[11/08/2020 19:35:28] Loaded "train" set inputs of data_type "text-features" with data_id "source_text" and length 80000.
[11/08/2020 19:35:29]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:29] Creating vocabulary for data with data_id 'state_below'.
[11/08/2020 19:35:29]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[11/08/2020 19:35:29] Creating dictionary of all words
[11/08/2020 19:35:30] Loaded "train" set inputs of data_type "text-features" with data_id "state_below" and length 80000.
[11/08/2020 19:35:30] Loaded "train" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 19:35:30]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:30] Loaded "test" set inputs of data_type "text-features" with data_id "source_text" and length 1000.
[11/08/2020 19:35:30] Loaded "test" set inputs of data_type "ghost" with data_id "state_below" and length 1000.
[11/08/2020 19:35:30] Loaded "test" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 19:35:30]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:35:30] Loaded "val" set inputs of data_type "text-features" with data_id "source_text" and length 1800.
[11/08/2020 19:35:30] Loaded "val" set inputs of data_type "ghost" with data_id "state_below" and length 1800.
[11/08/2020 19:35:30] Loaded "val" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 19:35:30] Keeping 1 captions per input on the val set.
[11/08/2020 19:35:30] Samples reduced to 1800 in val set.
[11/08/2020 19:35:30] <<< Saving Dataset instance to datasets/Dataset_mydata55_hien.pkl ... >>>
[11/08/2020 19:35:31] <<< Dataset instance saved >>>
[11/08/2020 19:35:31] <<< Building AttentionRNNEncoderDecoder Translation_Model >>>
[11/08/2020 19:35:31] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:650: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

[11/08/2020 19:35:31] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:4786: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

[11/08/2020 19:35:31] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:157: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

[11/08/2020 19:35:37] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:3561: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2020-08-11 19:35:39.189366: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-08-11 19:35:39.216265: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-11 19:35:39.217710: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xb6fd630 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-11 19:35:39.217731: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
  File "main.py", line 51, in <module>
    train_model(parameters, args.dataset)
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/training.py", line 93, in train_model
    clear_dirs=clear_dirs)
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/model_zoo.py", line 155, in __init__
    eval('self.' + model_type + '(params)')
  File "<string>", line 1, in <module>
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/model_zoo.py", line 762, in AttentionRNNEncoderDecoder
    self.multi_gpu_model = multi_gpu_model(self.model, gpus=params['N_GPUS'])
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/utils/multi_gpu_utils.py", line 189, in multi_gpu_model
    available_devices))
ValueError: To call `multi_gpu_model` with `gpus=3`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2']. However this machine only has: ['/cpu:0']. Try reducing `gpus`.

lvapeab commented 3 years ago

It looks like Tensorflow doesn't detect any GPUs. Is tensorflow-gpu installed? Is the environment variable CUDA_VISIBLE_DEVICES properly set?

VP007-py commented 3 years ago

echo $CUDA_VISIBLE_DEVICES gives me 0,1,2 and after installing tensorflow-gpu I get

Using TensorFlow backend.
2020-08-11 20:15:00.601205: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[11/08/2020 20:15:07] Running training.
[11/08/2020 20:15:07] Building mydata55_hien dataset
[11/08/2020 20:15:08]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:08] Creating vocabulary for data with data_id 'target_text'.
[11/08/2020 20:15:08]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[11/08/2020 20:15:08] Creating dictionary of all words
[11/08/2020 20:15:08] Loaded "train" set outputs of data_type "text-features" with data_id "target_text" and length 80000.
[11/08/2020 20:15:08]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:08] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 1800.
[11/08/2020 20:15:08]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:08] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 1000.
[11/08/2020 20:15:09]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:09] Creating vocabulary for data with data_id 'source_text'.
[11/08/2020 20:15:09]    Total: 42963 unique words in 80000 sentences with a total of 466800 words.
[11/08/2020 20:15:09] Creating dictionary of all words
[11/08/2020 20:15:10] Loaded "train" set inputs of data_type "text-features" with data_id "source_text" and length 80000.
[11/08/2020 20:15:10]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:10] Creating vocabulary for data with data_id 'state_below'.
[11/08/2020 20:15:10]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[11/08/2020 20:15:10] Creating dictionary of all words
[11/08/2020 20:15:11] Loaded "train" set inputs of data_type "text-features" with data_id "state_below" and length 80000.
[11/08/2020 20:15:11] Loaded "train" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 20:15:11]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:11] Loaded "val" set inputs of data_type "text-features" with data_id "source_text" and length 1800.
[11/08/2020 20:15:11] Loaded "val" set inputs of data_type "ghost" with data_id "state_below" and length 1800.
[11/08/2020 20:15:11] Loaded "val" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 20:15:11]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:15:11] Loaded "test" set inputs of data_type "text-features" with data_id "source_text" and length 1000.
[11/08/2020 20:15:11] Loaded "test" set inputs of data_type "ghost" with data_id "state_below" and length 1000.
[11/08/2020 20:15:11] Loaded "test" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 20:15:11] Keeping 1 captions per input on the val set.
[11/08/2020 20:15:11] Samples reduced to 1800 in val set.
[11/08/2020 20:15:11] <<< Saving Dataset instance to datasets/Dataset_mydata55_hien.pkl ... >>>
[11/08/2020 20:15:12] <<< Dataset instance saved >>>
[11/08/2020 20:15:12] <<< Building AttentionRNNEncoderDecoder Translation_Model >>>
Traceback (most recent call last):
  File "main.py", line 51, in <module>
    train_model(parameters, args.dataset)
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/training.py", line 93, in train_model
    clear_dirs=clear_dirs)
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/model_zoo.py", line 155, in __init__
    eval('self.' + model_type + '(params)')
  File "<string>", line 1, in <module>
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/model_zoo.py", line 457, in AttentionRNNEncoderDecoder
    src_text = Input(name=self.ids_inputs[0], batch_shape=tuple([None, None]), dtype='int32')
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/engine/input_layer.py", line 178, in Input
    input_tensor=tensor)
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/engine/input_layer.py", line 87, in __init__
    name=self.name)
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 650, in placeholder
    x = tf.placeholder(dtype, shape=shape, name=name)
AttributeError: module 'tensorflow' has no attribute 'placeholder'

lvapeab commented 3 years ago

I think you installed tensorflow 2.X. You should install 1.x (e.g. pip install tensorflow-gpu==1.15)

VP007-py commented 3 years ago

This step resolves the errors, but I'm running out of memory as shown.

python3 main.py 
Using TensorFlow backend.
[11/08/2020 20:50:13] Running training.
[11/08/2020 20:50:13] Building mydata55_hien dataset
[11/08/2020 20:50:13]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:14] Creating vocabulary for data with data_id 'target_text'.
[11/08/2020 20:50:14]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[11/08/2020 20:50:14] Creating dictionary of all words
[11/08/2020 20:50:14] Loaded "train" set outputs of data_type "text-features" with data_id "target_text" and length 80000.
[11/08/2020 20:50:14]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:14] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 1800.
[11/08/2020 20:50:14]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:14] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 1000.
[11/08/2020 20:50:15]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:15] Creating vocabulary for data with data_id 'source_text'.
[11/08/2020 20:50:15]    Total: 42963 unique words in 80000 sentences with a total of 466800 words.
[11/08/2020 20:50:15] Creating dictionary of all words
[11/08/2020 20:50:15] Loaded "train" set inputs of data_type "text-features" with data_id "source_text" and length 80000.
[11/08/2020 20:50:16]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:16] Creating vocabulary for data with data_id 'state_below'.
[11/08/2020 20:50:16]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[11/08/2020 20:50:16] Creating dictionary of all words
[11/08/2020 20:50:17] Loaded "train" set inputs of data_type "text-features" with data_id "state_below" and length 80000.
[11/08/2020 20:50:17] Loaded "train" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 20:50:17]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:17] Loaded "val" set inputs of data_type "text-features" with data_id "source_text" and length 1800.
[11/08/2020 20:50:17] Loaded "val" set inputs of data_type "ghost" with data_id "state_below" and length 1800.
[11/08/2020 20:50:17] Loaded "val" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 20:50:17]   Applying tokenization function: "tokenize_none".
[11/08/2020 20:50:17] Loaded "test" set inputs of data_type "text-features" with data_id "source_text" and length 1000.
[11/08/2020 20:50:17] Loaded "test" set inputs of data_type "ghost" with data_id "state_below" and length 1000.
[11/08/2020 20:50:17] Loaded "test" set inputs of type "file-name" with id "raw_source_text".
[11/08/2020 20:50:17] Keeping 1 captions per input on the val set.
[11/08/2020 20:50:17] Samples reduced to 1800 in val set.
[11/08/2020 20:50:17] <<< Saving Dataset instance to datasets/Dataset_mydata55_hien.pkl ... >>>
[11/08/2020 20:50:18] <<< Dataset instance saved >>>
[11/08/2020 20:50:18] <<< Building AttentionRNNEncoderDecoder Translation_Model >>>
[11/08/2020 20:50:18] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:650: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

[11/08/2020 20:50:18] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:4786: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

[11/08/2020 20:50:18] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:157: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

[11/08/2020 20:50:24] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:3561: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
[11/08/2020 20:50:26] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:292: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

[11/08/2020 20:50:26] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:299: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

[11/08/2020 20:50:26] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:308: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-08-11 20:50:26.283764: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-08-11 20:50:26.311783: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399890000 Hz
2020-08-11 20:50:26.313223: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9548a50 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-11 20:50:26.313241: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-11 20:50:26.314794: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-11 20:50:27.979868: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9618b90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-11 20:50:27.979910: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-08-11 20:50:27.979920: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-08-11 20:50:27.979927: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-08-11 20:50:27.982474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
2020-08-11 20:50:27.983341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
2020-08-11 20:50:27.985994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
2020-08-11 20:50:27.991316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-11 20:50:28.058570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-11 20:50:28.091144: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-11 20:50:28.102065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-11 20:50:28.178217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-11 20:50:28.227447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-11 20:50:28.227515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-11 20:50:28.233741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2
2020-08-11 20:50:28.233800: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-11 20:50:28.238665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-11 20:50:28.238686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1 2 
2020-08-11 20:50:28.238712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N Y N 
2020-08-11 20:50:28.238721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   Y N N 
2020-08-11 20:50:28.238732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 2:   N N N 
2020-08-11 20:50:28.243290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10481 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2020-08-11 20:50:28.244710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-08-11 20:50:28.245980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
[11/08/2020 20:50:28] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:312: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

[11/08/2020 20:50:28] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:321: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

[11/08/2020 20:50:28] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:328: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

-----------------------------------------------------------------------------------
        TranslationModel instance
-----------------------------------------------------------------------------------
_model_type: AttentionRNNEncoderDecoder
name: mydata55_hien_AttentionRNNEncoderDecoder_src_emb_500_bidir_True_enc_LSTM_756_dec_ConditionalLSTM_756_deepout_linear_trg_emb_500_Adam_0.001
model_path: /scratch/trained_models/mydata55_hien_AttentionRNNEncoderDecoder_src_emb_500_bidir_True_enc_LSTM_756_dec_ConditionalLSTM_756_deepout_linear_trg_emb_500_Adam_0.001/
verbose: 1

Params:
    ACCUMULATE_GRADIENTS: 1
    ADDITIONAL_OUTPUT_MERGE_MODE: Add
    ALIGN_FROM_RAW: True
    ALPHA_FACTOR: 0.6
    AMSGRAD: False
    APPLY_DETOKENIZATION: False
    ATTENTION_DROPOUT_P: 0.0
    ATTENTION_MODE: add
    ATTENTION_SIZE: 756
    BATCH_NORMALIZATION_MODE: 1
    BATCH_SIZE: 412
    BEAM_SEARCH: True
    BEAM_SIZE: 3
    BETA_1: 0.9
    BETA_2: 0.999
    BIDIRECTIONAL_DEEP_ENCODER: True
    BIDIRECTIONAL_ENCODER: True
    BIDIRECTIONAL_MERGE_MODE: concat
    BPE_CODES_PATH: examples/mydata55//training_codes.joint
    CLASSIFIER_ACTIVATION: softmax
    CLIP_C: 5.0
    CLIP_V: 0.0
    COVERAGE_NORM_FACTOR: 0.2
    COVERAGE_PENALTY: False
    DATASET_NAME: mydata55
    DATASET_STORE_PATH: datasets/
    DATA_AUGMENTATION: False
    DATA_ROOT_PATH: examples/mydata55/
    DECODER_HIDDEN_SIZE: 756
    DECODER_RNN_TYPE: ConditionalLSTM
    DEEP_OUTPUT_LAYERS: [('linear', 500)]
    DETOKENIZATION_METHOD: detokenize_none
    DOUBLE_STOCHASTIC_ATTENTION_REG: 0.0
    DROPOUT_P: 0.0
    EARLY_STOP: True
    EMBEDDINGS_FREQ: 1
    ENCODER_HIDDEN_SIZE: 756
    ENCODER_RNN_TYPE: LSTM
    EPOCHS_FOR_SAVE: 1
    EPSILON: 1e-08
    EVAL_EACH: 1
    EVAL_EACH_EPOCHS: True
    EVAL_ON_SETS: ['val']
    EXTRA_NAME: 
    FF_SIZE: 128
    FILL: end
    FORCE_RELOAD_VOCABULARY: False
    GLOSSARY: None
    GRU_RESET_AFTER: True
    HEURISTIC: 0
    HOMOGENEOUS_BATCHES: False
    INIT_ATT: glorot_uniform
    INIT_FUNCTION: glorot_uniform
    INIT_LAYERS: ['tanh']
    INNER_INIT: orthogonal
    INPUTS_IDS_DATASET: ['source_text', 'state_below']
    INPUTS_IDS_MODEL: ['source_text', 'state_below']
    INPUTS_TYPES_DATASET: ['text-features', 'text-features']
    INPUT_VOCABULARY_SIZE: 42966
    JOINT_BATCHES: 4
    KERAS_METRICS: ['perplexity']
    LABEL_SMOOTHING: 0.0
    LENGTH_NORM_FACTOR: 0.2
    LENGTH_PENALTY: False
    LOG_DIR: tensorboard_logs
    LOSS: categorical_crossentropy
    LR: 0.001
    LR_DECAY: None
    LR_GAMMA: 0.8
    LR_HALF_LIFE: 100
    LR_REDUCER_EXP_BASE: -0.5
    LR_REDUCER_TYPE: exponential
    LR_REDUCE_EACH_EPOCHS: False
    LR_START_REDUCTION_ON_EPOCH: 0
    MAPPING: examples/mydata55//mapping.hi_en.pkl
    MAXLEN_GIVEN_X: True
    MAXLEN_GIVEN_X_FACTOR: 2
    MAX_EPOCH: 15
    MAX_INPUT_TEXT_LEN: 150
    MAX_OUTPUT_TEXT_LEN: 150
    MAX_OUTPUT_TEXT_LEN_TEST: 450
    MAX_PLOT_Y: 100.0
    METRICS: ['perplexity']
    MINLEN_GIVEN_X: True
    MINLEN_GIVEN_X_FACTOR: 3
    MIN_DELTA: 0.0
    MIN_LR: 1e-09
    MIN_OCCURRENCES_INPUT_VOCAB: 0
    MIN_OCCURRENCES_OUTPUT_VOCAB: 0
    MODE: training
    MODEL_NAME: mydata55_hien_AttentionRNNEncoderDecoder_src_emb_500_bidir_True_enc_LSTM_756_dec_ConditionalLSTM_756_deepout_linear_trg_emb_500_Adam_0.001
    MODEL_SIZE: 32
    MODEL_TYPE: AttentionRNNEncoderDecoder
    MOMENTUM: 0.0
    MULTIHEAD_ATTENTION_ACTIVATION: linear
    NESTEROV_MOMENTUM: False
    NOISE_AMOUNT: 0.01
    NORMALIZE_SAMPLING: False
    N_GPUS: 3
    N_HEADS: 8
    N_LAYERS_DECODER: 2
    N_LAYERS_ENCODER: 2
    N_SAMPLES: 5
    OPTIMIZED_SEARCH: True
    OPTIMIZER: Adam
    OUTPUTS_IDS_DATASET: ['target_text']
    OUTPUTS_IDS_MODEL: ['target_text']
    OUTPUTS_TYPES_DATASET: ['text-features']
    OUTPUT_VOCABULARY_SIZE: 34431
    PAD_ON_BATCH: True
    PARALLEL_LOADERS: 1
    PATIENCE: 5
    PLOT_EVALUATION: False
    POS_UNK: True
    REBUILD_DATASET: True
    RECURRENT_DROPOUT_P: 0.0
    RECURRENT_INPUT_DROPOUT_P: 0.0
    RECURRENT_WEIGHT_DECAY: 0.0
    REGULARIZATION_FN: L2
    RELOAD: 0
    RELOAD_EPOCH: False
    RHO: 0.9
    SAMPLE_EACH_UPDATES: 300
    SAMPLE_ON_SETS: ['train', 'val']
    SAMPLE_WEIGHTS: True
    SAMPLING: max_likelihood
    SAMPLING_SAVE_MODE: list
    SAVE_EACH_EVALUATION: True
    SCALE_SOURCE_WORD_EMBEDDINGS: False
    SCALE_TARGET_WORD_EMBEDDINGS: False
    SEARCH_PRUNING: False
    SKIP_VECTORS_HIDDEN_SIZE: 500
    SKIP_VECTORS_SHARED_ACTIVATION: tanh
    SOURCE_TEXT_EMBEDDING_SIZE: 500
    SRC_LAN: hi
    SRC_PRETRAINED_VECTORS: None
    SRC_PRETRAINED_VECTORS_TRAINABLE: True
    START_EVAL_ON_EPOCH: 1
    START_SAMPLING_ON_EPOCH: 1
    STOP_METRIC: Bleu_4
    STORE_PATH: /scratch/trained_models/mydata55_hien_AttentionRNNEncoderDecoder_src_emb_500_bidir_True_enc_LSTM_756_dec_ConditionalLSTM_756_deepout_linear_trg_emb_500_Adam_0.001/
    TARGET_TEXT_EMBEDDING_SIZE: 500
    TASK_NAME: mydata55
    TEMPERATURE: 1
    TENSORBOARD: True
    TEXT_FILES: {'train': 'train.', 'val': 'val.', 'test': 'TEST.'}
    TIE_EMBEDDINGS: False
    TOKENIZATION_METHOD: tokenize_none
    TOKENIZE_HYPOTHESES: True
    TOKENIZE_REFERENCES: True
    TRAINABLE_DECODER: True
    TRAINABLE_ENCODER: True
    TRAIN_ON_TRAINVAL: False
    TRG_LAN: en
    TRG_PRETRAINED_VECTORS: None
    TRG_PRETRAINED_VECTORS_TRAINABLE: True
    USE_BATCH_NORMALIZATION: True
    USE_CUDNN: False
    USE_L1: False
    USE_L2: False
    USE_NOISE: False
    USE_PRELU: False
    USE_TF_OPTIMIZER: True
    VERBOSE: 1
    WARMUP_EXP: -1.5
    WEIGHT_DECAY: 0.0001
    WRITE_VALID_SAMPLES: True
-----------------------------------------------------------------------------------
Model: "mydata55_hien_AttentionRNNEncoderDecoder_src_emb_500_bidir_True_enc_LSTM_756_dec_ConditionalLSTM_756_deepout_linear_trg_emb_500_Adam_0.001_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
source_text (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
source_word_embedding (Embeddin (None, None, 500)    21483000    source_text[0][0]                
__________________________________________________________________________________________________
src_embedding_batch_normalizati (None, None, 500)    2000        source_word_embedding[0][0]      
__________________________________________________________________________________________________
remove_mask_1 (RemoveMask)      (None, None, 500)    0           src_embedding_batch_normalization
__________________________________________________________________________________________________
bidirectional_encoder_LSTM (Bid (None, None, 1512)   7602336     remove_mask_1[0][0]              
__________________________________________________________________________________________________
annotations_batch_normalization (None, None, 1512)   6048        bidirectional_encoder_LSTM[0][0] 
__________________________________________________________________________________________________
bidirectional_encoder_1 (Bidire (None, None, 1512)   13722912    annotations_batch_normalization[0
__________________________________________________________________________________________________
annotations_1_batch_normalizati (None, None, 1512)   6048        bidirectional_encoder_1[0][0]    
__________________________________________________________________________________________________
add_1 (Add)                     (None, None, 1512)   0           annotations_batch_normalization[0
                                                                 annotations_1_batch_normalization
__________________________________________________________________________________________________
source_text_mask (GetMask)      (None, None, 500)    0           src_embedding_batch_normalization
__________________________________________________________________________________________________
annotations (ApplyMask)         (None, None, 1512)   0           add_1[0][0]                      
                                                                 source_text_mask[0][0]           
__________________________________________________________________________________________________
state_below (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
ctx_mean (MaskedMean)           (None, 1512)         0           annotations[0][0]                
__________________________________________________________________________________________________
target_word_embedding (Embeddin (None, None, 500)    17215500    state_below[0][0]                
__________________________________________________________________________________________________
initial_state (Dense)           (None, 756)          1143828     ctx_mean[0][0]                   
__________________________________________________________________________________________________
initial_memory (Dense)          (None, 756)          1143828     ctx_mean[0][0]                   
__________________________________________________________________________________________________
state_below_batch_normalization (None, None, 500)    2000        target_word_embedding[0][0]      
__________________________________________________________________________________________________
initial_state_batch_normalizati (None, 756)          3024        initial_state[0][0]              
__________________________________________________________________________________________________
initial_memory_batch_normalizat (None, 756)          3024        initial_memory[0][0]             
__________________________________________________________________________________________________
decoder_AttConditionalLSTMCond  [(None, None, 756),  12378745    state_below_batch_normalization[0
                                                                 annotations[0][0]                
                                                                 initial_state_batch_normalization
                                                                 initial_memory_batch_normalizatio
__________________________________________________________________________________________________
proj_h0_batch_normalization (Ba (None, None, 756)    3024        decoder_AttConditionalLSTMCond[0]
__________________________________________________________________________________________________
permute_general_1 (PermuteGener multiple             0           decoder_AttConditionalLSTMCond[0]
                                                                 logit_ctx[0][0]                  
__________________________________________________________________________________________________
decoder_LSTMCond1 (LSTMCond)    [(None, None, 756),  9147600     proj_h0_batch_normalization[0][0]
                                                                 permute_general_1[0][0]          
                                                                 initial_state_batch_normalization
                                                                 initial_memory_batch_normalizatio
__________________________________________________________________________________________________
proj_h1_batch_normalization (Ba (None, None, 756)    3024        decoder_LSTMCond1[0][0]          
__________________________________________________________________________________________________
add_2 (Add)                     (None, None, 756)    0           proj_h0_batch_normalization[0][0]
                                                                 proj_h1_batch_normalization[0][0]
__________________________________________________________________________________________________
logit_ctx (TimeDistributed)     (None, None, 500)    756500      decoder_AttConditionalLSTMCond[0]
__________________________________________________________________________________________________
logit_lstm (TimeDistributed)    (None, None, 500)    378500      add_2[0][0]                      
__________________________________________________________________________________________________
logit_emb (TimeDistributed)     (None, None, 500)    250500      state_below_batch_normalization[0
__________________________________________________________________________________________________
out_layer_mlp_batch_normalizati (None, None, 500)    2000        logit_lstm[0][0]                 
__________________________________________________________________________________________________
out_layer_ctx_batch_normalizati (None, None, 500)    2000        permute_general_1[1][0]          
__________________________________________________________________________________________________
out_layer_emb_batch_normalizati (None, None, 500)    2000        logit_emb[0][0]                  
__________________________________________________________________________________________________
additional_input (Add)          (None, None, 500)    0           out_layer_mlp_batch_normalization
                                                                 out_layer_ctx_batch_normalization
                                                                 out_layer_emb_batch_normalization
__________________________________________________________________________________________________
activation_1 (Activation)       (None, None, 500)    0           additional_input[0][0]           
__________________________________________________________________________________________________
linear_0 (TimeDistributed)      (None, None, 500)    250500      activation_1[0][0]               
__________________________________________________________________________________________________
out_layer_linear_0_batch_normal (None, None, 500)    2000        linear_0[0][0]                   
__________________________________________________________________________________________________
target_text (TimeDistributed)   (None, None, 34431)  17249931    out_layer_linear_0_batch_normaliz
==================================================================================================
Total params: 102,759,872
Trainable params: 102,741,776
Non-trainable params: 18,096
__________________________________________________________________________________________________
[11/08/2020 20:50:40] From /home/pandramish.vinay/nmt-keras/nmt_keras/model_zoo.py:213: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

[11/08/2020 20:50:40] Preparing optimizer and compiling. Optimizer configuration: 
     LR: 0.001
     LOSS: categorical_crossentropy
     BETA_1: 0.9
     BETA_2: 0.999
     EPSILON: 1e-08
[11/08/2020 20:50:40] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1192: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

[11/08/2020 20:50:40] <<< Training model >>>
[11/08/2020 20:50:40] Training parameters: { 
    batch_size: 412
    class_weights: None
    da_enhance_list: []
    da_patch_type: resize_and_rndcrop
    data_augmentation: False
    each_n_epochs: 1
    epoch_offset: 0
    epochs_for_save: 1
    eval_on_epochs: True
    eval_on_sets: None
    extra_callbacks: [<keras_wrapper.extra.callbacks.EvalPerformance object at 0x14d0400de320>, <keras_wrapper.extra.callbacks.Sample object at 0x14d0400e6860>]
    homogeneous_batches: False
    initial_lr: 0.001
    joint_batches: 4
    lr_decay: None
    lr_gamma: 0.8
    lr_half_life: 100
    lr_reducer_exp_base: -0.5
    lr_reducer_type: exponential
    lr_warmup_exp: -1.5
    maxlen: 150
    mean_substraction: False
    metric_check: Bleu_4
    min_delta: 0.0
    min_lr: 1e-09
    n_epochs: 15
    n_gpus: 3
    n_parallel_loaders: 1
    normalization_type: None
    normalize: False
    num_iterations_val: None
    patience: 5
    patience_check_split: val
    reduce_each_epochs: False
    reload_epoch: 0
    shuffle: True
    start_eval_on_epoch: 1
    start_reduction_on_epoch: 0
    tensorboard: True
    tensorboard_params: {'write_grads': False, 'batch_size': 412, 'update_freq': 'epoch', 'embeddings_layer_names': None, 'histogram_freq': 0, 'write_images': False, 'word_embeddings_labels': None, 'write_graph': True, 'log_dir': 'tensorboard_logs', 'embeddings_freq': None, 'embeddings_metadata': None}
    verbose: 1
    wo_da_patch_type: whole
}
[11/08/2020 20:50:40] <<< creating directory /scratch/trained_models/mydata55_hien_AttentionRNNEncoderDecoder_src_emb_500_bidir_True_enc_LSTM_756_dec_ConditionalLSTM_756_deepout_linear_trg_emb_500_Adam_0.001/tensorboard_logs ... >>>
[11/08/2020 20:50:44] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/callbacks/tensorboard_v1.py:200: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

[11/08/2020 20:50:44] From /home/pandramish.vinay/.local/lib/python3.5/site-packages/keras/callbacks/tensorboard_v1.py:203: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Epoch 1/15
2020-08-11 20:51:21.244817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-11 20:51:36.044298: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 27.4KiB (rounded to 28160).  Current allocation summary follows.
2020-08-11 20:51:36.044578: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):   Total Chunks: 243, Chunks in use: 239. 60.8KiB allocated for chunks. 59.8KiB in use in bin. 2.5KiB client-requested in use in bin.
2020-08-11 20:51:36.044605: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512):   Total Chunks: 13, Chunks in use: 11. 9.2KiB allocated for chunks. 8.2KiB in use in bin. 5.9KiB client-requested in use in bin.
2020-08-11 20:51:36.044629: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024):  Total Chunks: 2, Chunks in use: 2. 2.2KiB allocated for chunks. 2.2KiB in use in bin. 1.5KiB client-requested in use in bin.
2020-08-11 20:51:36.044640: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048):  Total Chunks: 252, Chunks in use: 251. 632.2KiB allocated for chunks. 630.2KiB in use in bin. 606.3KiB client-requested in use in bin.
2020-08-11 20:51:36.044665: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096):  Total Chunks: 58, Chunks in use: 58. 310.2KiB allocated for chunks. 310.2KiB in use in bin. 275.8KiB client-requested in use in bin.
2020-08-11 20:51:36.044674: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192):  Total Chunks: 70, Chunks in use: 70. 740.5KiB allocated for chunks. 740.5KiB in use in bin. 719.1KiB client-requested in use in bin.
2020-08-11 20:51:36.044682: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384):     Total Chunks: 211, Chunks in use: 211. 3.44MiB allocated for chunks. 3.44MiB in use in bin. 3.30MiB client-requested in use in bin.
2020-08-11 20:51:36.044691: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768):     Total Chunks: 1, Chunks in use: 1. 32.2KiB allocated for chunks. 32.2KiB in use in bin. 16.1KiB client-requested in use in bin.
2020-08-11 20:51:36.044700: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536):     Total Chunks: 195, Chunks in use: 195. 19.33MiB allocated for chunks. 19.33MiB in use in bin. 19.29MiB client-requested in use in bin.
2020-08-11 20:51:36.044708: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072):    Total Chunks: 62, Chunks in use: 62. 11.79MiB allocated for chunks. 11.79MiB in use in bin. 9.23MiB client-requested in use in bin.
2020-08-11 20:51:36.044717: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144):    Total Chunks: 2601, Chunks in use: 2601. 1015.97MiB allocated for chunks. 1015.97MiB in use in bin. 1012.12MiB client-requested in use in bin.
2020-08-11 20:51:36.044726: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288):    Total Chunks: 422, Chunks in use: 422. 332.47MiB allocated for chunks. 332.47MiB in use in bin. 256.36MiB client-requested in use in bin.
2020-08-11 20:51:36.044750: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576):   Total Chunks: 9, Chunks in use: 9. 13.00MiB allocated for chunks. 13.00MiB in use in bin. 11.03MiB client-requested in use in bin.
2020-08-11 20:51:36.044759: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152):   Total Chunks: 56, Chunks in use: 56. 127.48MiB allocated for chunks. 127.48MiB in use in bin. 126.31MiB client-requested in use in bin.
2020-08-11 20:51:36.044767: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304):   Total Chunks: 122, Chunks in use: 122. 648.19MiB allocated for chunks. 648.19MiB in use in bin. 639.69MiB client-requested in use in bin.
2020-08-11 20:51:36.044775: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608):   Total Chunks: 112, Chunks in use: 112. 1.21GiB allocated for chunks. 1.21GiB in use in bin. 1.18GiB client-requested in use in bin.
2020-08-11 20:51:36.044782: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216):  Total Chunks: 99, Chunks in use: 99. 2.11GiB allocated for chunks. 2.11GiB in use in bin. 2.02GiB client-requested in use in bin.
2020-08-11 20:51:36.044790: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432):  Total Chunks: 2, Chunks in use: 2. 69.81MiB allocated for chunks. 69.81MiB in use in bin. 47.76MiB client-requested in use in bin.
2020-08-11 20:51:36.044798: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864):  Total Chunks: 15, Chunks in use: 15. 1.09GiB allocated for chunks. 1.09GiB in use in bin. 1.04GiB client-requested in use in bin.
2020-08-11 20:51:36.044805: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-08-11 20:51:36.044813: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456):     Total Chunks: 6, Chunks in use: 6. 3.63GiB allocated for chunks. 3.63GiB in use in bin. 3.59GiB client-requested in use in bin.
2020-08-11 20:51:36.044821: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 27.5KiB was 16.0KiB, Chunk State: 
2020-08-11 20:51:36.044827: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 2399508736
2020-08-11 20:51:36.044843: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8b0000000 next 3716 of size 964619008
2020-08-11 20:51:36.044849: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8e97eeb00 next 4149 of size 414464
2020-08-11 20:51:36.044855: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8e9853e00 next 4150 of size 414464
2020-08-11 20:51:36.044861: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8e98b9100 next 4151 of size 828672
2020-08-11 20:51:36.044867: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8e9983600 next 4152 of size 24857344
2020-08-11 20:51:36.044873: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb138100 next 4153 of size 16640
2020-08-11 20:51:36.044879: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb13c200 next 4154 of size 414464
2020-08-11 20:51:36.044884: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb1a1500 next 4155 of size 417536
2020-08-11 20:51:36.044890: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb207400 next 4156 of size 417536
2020-08-11 20:51:36.044896: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb26d300 next 4157 of size 417536
2020-08-11 20:51:36.044902: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb2d3200 next 4158 of size 208896
2020-08-11 20:51:36.044908: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb306200 next 4159 of size 104448
2020-08-11 20:51:36.044914: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb31fa00 next 4160 of size 104448
2020-08-11 20:51:36.044919: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb339200 next 4161 of size 104448
2020-08-11 20:51:36.044942: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb352a00 next 4162 of size 417536
2020-08-11 20:51:36.044947: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb3b8900 next 4163 of size 834816
2020-08-11 20:51:36.044953: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eb484600 next 4164 of size 25038848
2020-08-11 20:51:36.044959: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecc65600 next 4165 of size 16640
2020-08-11 20:51:36.044964: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecc69700 next 4166 of size 16640
2020-08-11 20:51:36.044970: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecc6d800 next 4167 of size 16640
2020-08-11 20:51:36.044975: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecc71900 next 4168 of size 834816
2020-08-11 20:51:36.044981: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecd3d600 next 4169 of size 417536
2020-08-11 20:51:36.044986: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecda3500 next 4170 of size 417536
2020-08-11 20:51:36.044992: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ece09400 next 4171 of size 417536
2020-08-11 20:51:36.044998: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ece6f300 next 4172 of size 417536
2020-08-11 20:51:36.045003: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eced5200 next 4173 of size 417536
2020-08-11 20:51:36.045009: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecf3b100 next 4174 of size 16640
2020-08-11 20:51:36.045014: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ecf3f200 next 4175 of size 12519424
2020-08-11 20:51:36.045020: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8edb2fa00 next 4176 of size 417536
2020-08-11 20:51:36.045026: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8edb95900 next 4177 of size 2286336
2020-08-11 20:51:36.045031: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eddc3c00 next 4178 of size 417536
2020-08-11 20:51:36.045037: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ede29b00 next 4180 of size 417536
2020-08-11 20:51:36.045042: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ede8fa00 next 4181 of size 12519424
2020-08-11 20:51:36.045048: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eea80200 next 4182 of size 417536
2020-08-11 20:51:36.045054: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eeae6100 next 4183 of size 417536
2020-08-11 20:51:36.045059: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eeb4c000 next 4184 of size 16640
2020-08-11 20:51:36.045065: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eeb50100 next 4185 of size 417536
2020-08-11 20:51:36.045070: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eebb6000 next 4186 of size 417536
2020-08-11 20:51:36.045076: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eec1bf00 next 4187 of size 417536
2020-08-11 20:51:36.045082: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eec81e00 next 4188 of size 828672
2020-08-11 20:51:36.045087: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eed4c300 next 4189 of size 414464
2020-08-11 20:51:36.045093: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eedb1600 next 4190 of size 414464
2020-08-11 20:51:36.045098: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eee16900 next 4191 of size 414464
2020-08-11 20:51:36.045104: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eee7bc00 next 4192 of size 414464
2020-08-11 20:51:36.045109: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eeee0f00 next 4193 of size 414464
2020-08-11 20:51:36.045115: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eef46200 next 4194 of size 414464
2020-08-11 20:51:36.045121: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eefab500 next 4195 of size 274176
2020-08-11 20:51:36.045126: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8eefee400 next 4196 of size 414464
2020-08-11 20:51:36.045131: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ef053700 next 4197 of size 414464
2020-08-11 20:51:36.045153: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x14c8ef0b8a00 next 4198 of size 414464

For other datasets,it's simply getting killed (Rest is same as above)


Epoch 1/15
Killed

``

lvapeab commented 3 years ago

I see that you are using a batch size of 412. This is probably too large. Try reducing it.

VP007-py commented 3 years ago

Works for as low as 64. Thank You !

VP007-py commented 3 years ago

The same error persists even now (batch size: 16/32/64 too)... Maybe update the keras_wrapper to limit gpu growth from here

lvapeab commented 3 years ago

keras_wrapper should not modify the behavior of GPU memory growth. In case you want to change the behavior of it, you can set an environment variable (export TF_FORCE_GPU_ALLOW_GROWTH=true).

Regarding you Killed error, it seems a problem with the main memory, rather than the GPU one.

VP007-py commented 3 years ago

Hey, I filtered out very long sentences from the dataset and it gave no issues (In a ideal scenario, the system should work for longer sentences). Yeah will set the environment variables if I run into errors !

lvapeab / nmt-keras

Detecting multiple GPUs #138