Closed donfour10 closed 4 years ago
I don't know exactly, but my assumption is that the test command doesn't recognize the batch size. Because other experiments (with less computational effort) working with the GPU test.
Is that possible?
Will look deeper into it, but can you please try with batch_size: 1
and eval_batch_size: 1
?
I'm currently running a training like that! Just to be safe: You mean to set these in the model_definition, right?
Same error and behavior as before. Training completed but test/predict failed with the same error message. Keep me posted if you find anything! Thanks :)
Can you please provide me a reproducible example? Something I can run myself and breaks. it may be a python script or a yaml + dataset + command combination. If your data is private you can use the script in data/dataset_synthesyzer.py
to obtain synthetic data.
Also, what GPU are you using? What amount of RAM does it have?
GPU:
Data + Model_definition + Commands/Script -> https://drive.google.com/open?id=1Lkf_PxvVli5SjoMJM4ny23mTcxAobE-1
bert model: https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
For the training you can either run the python script or the command in the .txt file.
Probably you have to change the paths to the bert model in the YAML-File (or in python script) and the paths to the data (in commands or python script).
Thank's for looking into it and I hope it's okay how I provide the data and other files to you!
PS: So for me it always breaks off at the test or predict after the training with the error I posted above.
@w4nderlust Hey, I tried a bit if I find a way to predict on the model. Now I can predict on a BERT model I trained, but it's working olny on CPU. So I trained on GPU because on CPU it will last weeks :) and after that i use a predict/test command or python file on CPU to predict on that model. Even the prediction last a few hours on CPU for the amount of data I tried it with. But I still don't get why the training is working on GPU but the predict/test not. Normally it can't be more computational effort as the training or am I thinking wrong? As well because within the training there is a test. Do you have any suggestions/assumptions for me or did you find something wrong in the data I provided?
Thanks for the update. Yes there should be no reason why it should work on CPU and not on GPU. I looked at the data and it looked fine. I haven't had the time yet to dig deeper unfortunately (there's a lot going on with Ludwig at the moment). I'll get back to you as soon as I can.
No problem. Thanks for looking into it! :)
@donfour10 a quick update: we are porting the entire Lduwig codebase to TF2, and by doing that we will also entirely change the BERT implementation. So if you don't mind, I'll postpone fixing this after the porting, as the new implementation would likely already fix it. If it doesn't, I'll make sure to fix it before releasing.
@w4nderlust Thanks for the update! I'm totally okay with that. I as well now found a way to use another bert model now on the GPU. However, I still don't know why it wasn't working with others properly.
Another update: the new BERT encoder and all the other transformers are now added to master. you can check them out! This issue does no longer apply, so I'm closing it, but feel free to open a new one if you find issues with the new ones.
Describe the bug I'm have trained a model with this definition on my GPU.
The training works properly but when I try to test or predict on this model, I get a ERROR:
2020-05-07 14:06:47.078732: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 7972913152 memorylimit: 7972913152 available bytes: 0 curr_region_allocationbytes: 8589934592 2020-05-07 14:06:47.078742: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 7972913152 InUse: 7972913152 MaxInUse: 7972913152 NumAllocs: 2388 MaxAllocSize: 125018112
2020-05-07 14:06:47.078800: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **** Traceback (most recent call last): File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[{{node save/RestoreV2}}]]
I know this is a tensorflow error and that the meaning is, that my GPU is out of memory, but I think that something is not right there. Because the train command is working also and after the training of each epoch there's even executed a test. So i don't get why then the test say's, my GPU gets out of memory.
I even tried to lower my test/predict dataset and do it with only one Row as input, but the same error message occures.
To Reproduce input features: text encoder: bert (https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip) output feature: set of categories
train cmd (is working): ludwig train --data_csv /analyst/ludwig_experiments/datasets/norm_and_imp_targets/training.csv -mdf /analyst/model_definition.yaml --output_directory /analyst/ludwig_experiments/datasets/norm_and_imp_targets/results --experiment_name bert --gpus 0
test cmd(also tried predict cmd): ludwig test --data_csv /analyst/ludwig_experiments/datasets/norm_and_imp_targets/pred_test.csv --m /analyst/ludwig_experiments/datasets/norm_and_imp_targets/results/bert_run_20/model -od /analyst/ludwig_experiments/datasets/norm_and_imp_targets/results/bert_run_20/test -g 0 -bs 1
If you need a sample of data, contact me and I see if I can provide something to you.
Expected behavior Test will be executed and I can see my predictions and probabilities.
Environment (please complete the following information):
Additional context I add the lower part of the ERROR massage here. I thought the upper part which I inserted above is the important part, but maybe I'm wrong.
Thanks in advance!
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/dustingpu/.local/bin/ludwig", line 10, in
sys.exit(main())
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/cli.py", line 108, in main
CLI()
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/cli.py", line 64, in init
getattr(self, args.command)()
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/cli.py", line 89, in test
test_performance.cli(sys.argv[2:])
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/test_performance.py", line 164, in cli
full_predict(**vars(args))
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/predict.py", line 100, in full_predict
debug
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/predict.py", line 182, in predict
gpu_fraction=gpu_fraction
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/models/model.py", line 1236, in predict
self.restore(session, self.weights_save_path)
File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/models/model.py", line 1371, in restore
self.saver.restore(session, weights_path)
File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[node save/RestoreV2 (defined at /lib/python3.7/site-packages/ludwig/models/model.py:216) ]]
Original stack trace for 'save/RestoreV2': File "/bin/ludwig", line 10, in
sys.exit(main())
File "/lib/python3.7/site-packages/ludwig/cli.py", line 108, in main
CLI()
File "/lib/python3.7/site-packages/ludwig/cli.py", line 64, in init
getattr(self, args.command)()
File "/lib/python3.7/site-packages/ludwig/cli.py", line 89, in test
test_performance.cli(sys.argv[2:])
File "/lib/python3.7/site-packages/ludwig/test_performance.py", line 164, in cli
full_predict(vars(args))
File "/lib/python3.7/site-packages/ludwig/predict.py", line 89, in full_predict
use_horovod=use_horovod)
File "/lib/python3.7/site-packages/ludwig/models/model.py", line 1677, in load_model_and_definition
model = Model.load(model_dir, use_horovod=use_horovod)
File "/lib/python3.7/site-packages/ludwig/models/model.py", line 1380, in load
model = Model(use_horovod=use_horovod, hyperparameters)
File "/lib/python3.7/site-packages/ludwig/models/model.py", line 113, in init
*kwargs
File "/lib/python3.7/site-packages/ludwig/models/model.py", line 216, in build
self.saver = tf.compat.v1.train.Saver()
File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 825, in init__
self.build()
File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 837, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 875, in _build
build_restore=build_restore)
File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(args, **kwargs)
File "/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()