Closed guhuawuli closed 4 years ago
you have out of memory
message. You don't have enough RAM to process this dataset
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,12,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
how many ram at least to run bert encoder? I have 13G memory and I can run bert model with another code(https://github.com/guanlinchao/bert-dst). Does ludwig need more memory than general bert fine tune?
@guhuawuli do you run it with the same dataset? Try to chunk it
I fond the solution, from here(https://medium.com/gowombat/first-impressions-about-ubers-ludwig-a-simple-machine-learning-tool-or-not-714962bbbedc). I must adjust batch size from 128 to 16
Sorry for the late answer. Yes the BERT encoder is pretty big and the size of the activations is big too, so depending on your available VRAM / RAM you may need to decrease the batch size to make it run on your system. Thank you for posting the solution to this problem, closing the thread.
Describe the bug A clear and concise description of what the bug is.
I want to do classification with bert encoder,my yaml file is input feature: name: review type: sequence encoder: bert config_path:
checkpoint_path:
do_lower_case: True
preprocessing:
tokenizer: bert
vocab_file:
padding_symbol: '[PAD]'
unknown_symbol: '[UNK]'
output feature
name label
type category
### the error message is : tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,12,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
Environment (please complete the following information): GPU 1 k80 memory 13G
the complete error message is: ludwig_version: '0.2.1' command: ('/usr/local/bin/ludwig experiment --data_csv ChnSentiCorp_htl_all.csv ' '--model_definition_file model_definition_bert.yaml') random_seed: 42 input_data: 'ChnSentiCorp_htl_all.csv' model_definition: { 'combiner': {'type': 'concat'}, 'input_features': [ { 'checkpoint_path': 'uncased_L-12_H-768_A-12/bert_model.ckpt', 'config_path': 'uncased_L-12_H-768_A-12/bert_config.json', 'do_lower_case': True, 'encoder': 'bert', 'name': 'review', 'preprocessing': { 'padding_symbol': '[PAD]', 'tokenizer': 'bert', 'unknown_symbol': '[UNK]', 'vocab_file': 'uncased_L-12_H-768_A-12/vocab.txt'}, 'tied_weights': None, 'type': 'sequence'}], 'output_features': [ { 'dependencies': [], 'loss': { 'class_similarities_temperature': 0, 'class_weights': 1, 'confidence_penalty': 0, 'distortion': 1, 'labels_smoothing': 0, 'negative_samples': 0, 'robust_lambda': 0, 'sampler': None, 'type': 'softmax_cross_entropy', 'unique': False, 'weight': 1}, 'name': 'label', 'reduce_dependencies': 'sum', 'reduce_input': 'sum', 'top_k': 3, 'type': 'category'}], 'preprocessing': { 'audio': { 'audio_feature': {'type': 'raw'}, 'audio_file_length_limit_in_s': 7.5, 'in_memory': True, 'missing_value_strategy': 'backfill', 'norm': None, 'padding_value': 0}, 'bag': { 'fill_value': '', 'lowercase': False, 'missing_value_strategy': 'fill_with_const', 'most_common': 10000, 'tokenizer': 'space'}, 'binary': { 'fill_value': 0, 'missing_value_strategy': 'fill_with_const'}, 'category': { 'fill_value': '',
'lowercase': False,
'missing_value_strategy': 'fill_with_const',
'most_common': 10000},
'date': { 'datetime_format': None,
'fill_value': '',
'missing_value_strategy': 'fill_with_const'},
'force_split': False,
'h3': { 'fill_value': 576495936675512319,
'missing_value_strategy': 'fill_with_const'},
'image': { 'in_memory': True,
'missing_value_strategy': 'backfill',
'num_processes': 1,
'resize_method': 'interpolate',
'scaling': 'pixel_normalization'},
'numerical': { 'fill_value': 0,
'missing_value_strategy': 'fill_with_const',
'normalization': None},
'sequence': { 'fill_value': '',
'lowercase': False,
'missing_value_strategy': 'fill_with_const',
'most_common': 20000,
'padding': 'right',
'padding_symbol': '',
'sequence_length_limit': 256,
'tokenizer': 'space',
'unknown_symbol': '',
'vocab_file': None},
'set': { 'fill_value': '',
'lowercase': False,
'missing_value_strategy': 'fill_with_const',
'most_common': 10000,
'tokenizer': 'space'},
'split_probabilities': (0.7, 0.1, 0.2),
'stratify': None,
'text': { 'char_most_common': 70,
'char_sequence_length_limit': 1024,
'char_tokenizer': 'characters',
'char_vocab_file': None,
'fill_value': '',
'lowercase': True,
'missing_value_strategy': 'fill_with_const',
'padding': 'right',
'padding_symbol': '',
'unknown_symbol': '',
'word_most_common': 20000,
'word_sequence_length_limit': 256,
'word_tokenizer': 'space_punct',
'word_vocab_file': None},
'timeseries': { 'fill_value': '',
'missing_value_strategy': 'fill_with_const',
'padding': 'right',
'padding_value': 0,
'timeseries_length_limit': 256,
'tokenizer': 'space'},
'vector': { 'fill_value': '',
'missing_value_strategy': 'fill_with_const'}},
'training': { 'batch_size': 128,
'bucketing_field': None,
'decay': False,
'decay_rate': 0.96,
'decay_steps': 10000,
'dropout_rate': 0.0,
'early_stop': 5,
'epochs': 100,
'eval_batch_size': 0,
'gradient_clipping': None,
'increase_batch_size_on_plateau': 0,
'increase_batch_size_on_plateau_max': 512,
'increase_batch_size_on_plateau_patience': 5,
'increase_batch_size_on_plateau_rate': 2,
'learning_rate': 0.001,
'learning_rate_warmup_epochs': 1,
'optimizer': { 'beta1': 0.9,
'beta2': 0.999,
'epsilon': 1e-08,
'type': 'adam'},
'reduce_learning_rate_on_plateau': 0,
'reduce_learning_rate_on_plateau_patience': 5,
'reduce_learning_rate_on_plateau_rate': 0.5,
'regularization_lambda': 0,
'regularizer': 'l2',
'staircase': False,
'validation_field': 'combined',
'validation_measure': 'loss'}}
Found hdf5 and json with the same filename of the csv, using them instead Using full hdf5 and json Loading data from: ChnSentiCorp_htl_all.hdf5
Loading metadata from: ChnSentiCorp_htl_all.json Training set: 5502 Validation set: 719 Test set: 1545 WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:358: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use
rate
instead ofkeep_prob
. Rate should be set torate = 1 - keep_prob
. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:671: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ludwig/models/modules/sequence_encoders.py:1731: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ludwig/models/modules/sequence_encoders.py:1742: The name tf.train.init_from_checkpoint is deprecated. Please use tf.compat.v1.train.init_from_checkpoint instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ludwig/models/modules/sequence_encoders.py:1749: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
╒══════════╕ │ TRAINING │ ╘══════════╛
2019-12-22 14:37:18.067377: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-12-22 14:37:18.075731: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2199995000 Hz 2019-12-22 14:37:18.076009: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7282d90 executing computations on platform Host. Devices: 2019-12-22 14:37:18.076121: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0):,
2019-12-22 14:37:21.772229: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Epoch 1 Training: 0%| | 0/43 [00:00<?, ?it/s]2019-12-22 14:38:11.186676: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[128,12,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,12,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [[{{node review/bert/encoder/layer_2/attention/self/dropout/mul}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/bin/ludwig", line 10, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/ludwig/cli.py", line 108, in main
CLI()
File "/usr/local/lib/python3.6/dist-packages/ludwig/cli.py", line 64, in init
getattr(self, args.command)()
File "/usr/local/lib/python3.6/dist-packages/ludwig/cli.py", line 69, in experiment
experiment.cli(sys.argv[2:])
File "/usr/local/lib/python3.6/dist-packages/ludwig/experiment.py", line 529, in cli
experiment(vars(args))
File "/usr/local/lib/python3.6/dist-packages/ludwig/experiment.py", line 219, in experiment
kwargs
File "/usr/local/lib/python3.6/dist-packages/ludwig/train.py", line 336, in full_train
debug=debug
File "/usr/local/lib/python3.6/dist-packages/ludwig/train.py", line 502, in train
**model_definition['training']
File "/usr/local/lib/python3.6/dist-packages/ludwig/models/model.py", line 538, in train
is_training=True
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,12,256,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node review/bert/encoder/layer_2/attention/self/dropout/mul (defined at /lib/python3.6/dist-packages/bert/modeling.py:358) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Errors may have originated from an input operation. Input Source operations connected to node review/bert/encoder/layer_2/attention/self/dropout/mul: review/bert/encoder/layer_2/attention/self/Softmax (defined at /lib/python3.6/dist-packages/bert/modeling.py:720)
Original stack trace for 'review/bert/encoder/layer_2/attention/self/dropout/mul': File "/bin/ludwig", line 10, in
sys.exit(main())
File "/lib/python3.6/dist-packages/ludwig/cli.py", line 108, in main
CLI()
File "/lib/python3.6/dist-packages/ludwig/cli.py", line 64, in init
getattr(self, args.command)()
File "/lib/python3.6/dist-packages/ludwig/cli.py", line 69, in experiment
experiment.cli(sys.argv[2:])
File "/lib/python3.6/dist-packages/ludwig/experiment.py", line 529, in cli
experiment(vars(args))
File "/lib/python3.6/dist-packages/ludwig/experiment.py", line 219, in experiment
kwargs
File "/lib/python3.6/dist-packages/ludwig/train.py", line 336, in full_train
debug=debug
File "/lib/python3.6/dist-packages/ludwig/train.py", line 483, in train
debug=debug
File "/lib/python3.6/dist-packages/ludwig/models/model.py", line 113, in init
kwargs
File "/lib/python3.6/dist-packages/ludwig/models/model.py", line 163, in __build
is_training=self.is_training
File "/lib/python3.6/dist-packages/ludwig/models/inputs.py", line 42, in build_inputs
kwargs)
File "/lib/python3.6/dist-packages/ludwig/models/inputs.py", line 69, in build_single_input
kwargs)
File "/lib/python3.6/dist-packages/ludwig/features/sequence_feature.py", line 167, in build_input
is_training=is_training
File "/lib/python3.6/dist-packages/ludwig/features/sequence_feature.py", line 182, in build_sequence_input
is_training=is_training
File "/lib/python3.6/dist-packages/ludwig/models/modules/sequence_encoders.py", line 1721, in call
token_type_ids=tf.zeros_like(input_sequence),
File "/lib/python3.6/dist-packages/bert/modeling.py", line 216, in init
do_return_all_layers=True)
File "/lib/python3.6/dist-packages/bert/modeling.py", line 844, in transformer_model
to_seq_length=seq_length)
File "/lib/python3.6/dist-packages/bert/modeling.py", line 724, in attention_layer
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
File "/lib/python3.6/dist-packages/bert/modeling.py", line 358, in dropout
output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
File "/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, *kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 4170, in dropout
return dropout_v2(x, rate, noise_shape=noise_shape, seed=seed, name=name)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 4255, in dropout_v2
ret = x scale math_ops.cast(keep_mask, x.dtype)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 884, in binary_op_wrapper
return func(x, y, name=name)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 1180, in _mul_dispatch
return gen_math_ops.mul(x, y, name=name)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 6490, in mul
"Mul", x=x, y=y, name=name)
File "/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(args, kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()
Training: 0%| | 0/43 [00:45<?, ?it/s]