ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.18k stars 1.19k forks source link

Dst tensor is not initialized. #705

Closed donfour10 closed 4 years ago

donfour10 commented 4 years ago

Describe the bug I'm have trained a model with this definition on my GPU.

image

The training works properly but when I try to test or predict on this model, I get a ERROR:


2020-05-07 14:06:47.078732: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 7972913152 memorylimit: 7972913152 available bytes: 0 curr_region_allocationbytes: 8589934592 2020-05-07 14:06:47.078742: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 7972913152 InUse: 7972913152 MaxInUse: 7972913152 NumAllocs: 2388 MaxAllocSize: 125018112

2020-05-07 14:06:47.078800: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **** Traceback (most recent call last): File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[{{node save/RestoreV2}}]]


I know this is a tensorflow error and that the meaning is, that my GPU is out of memory, but I think that something is not right there. Because the train command is working also and after the training of each epoch there's even executed a test. So i don't get why then the test say's, my GPU gets out of memory.

I even tried to lower my test/predict dataset and do it with only one Row as input, but the same error message occures.

To Reproduce input features: text encoder: bert (https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip) output feature: set of categories

train cmd (is working): ludwig train --data_csv /analyst/ludwig_experiments/datasets/norm_and_imp_targets/training.csv -mdf /analyst/model_definition.yaml --output_directory /analyst/ludwig_experiments/datasets/norm_and_imp_targets/results --experiment_name bert --gpus 0

test cmd(also tried predict cmd): ludwig test --data_csv /analyst/ludwig_experiments/datasets/norm_and_imp_targets/pred_test.csv --m /analyst/ludwig_experiments/datasets/norm_and_imp_targets/results/bert_run_20/model -od /analyst/ludwig_experiments/datasets/norm_and_imp_targets/results/bert_run_20/test -g 0 -bs 1

If you need a sample of data, contact me and I see if I can provide something to you.

Expected behavior Test will be executed and I can see my predictions and probabilities.

Environment (please complete the following information):

Additional context I add the lower part of the ERROR massage here. I thought the upper part which I inserted above is the important part, but maybe I'm wrong.

Thanks in advance!


During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/dustingpu/.local/bin/ludwig", line 10, in sys.exit(main()) File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/cli.py", line 108, in main CLI() File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/cli.py", line 64, in init getattr(self, args.command)() File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/cli.py", line 89, in test test_performance.cli(sys.argv[2:]) File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/test_performance.py", line 164, in cli full_predict(**vars(args)) File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/predict.py", line 100, in full_predict debug File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/predict.py", line 182, in predict gpu_fraction=gpu_fraction File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/models/model.py", line 1236, in predict self.restore(session, self.weights_save_path) File "/home/dustingpu/.local/lib/python3.7/site-packages/ludwig/models/model.py", line 1371, in restore self.saver.restore(session, weights_path) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 1286, in restore {self.saver_def.filename_tensor_name: save_path}) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/dustingpu/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[node save/RestoreV2 (defined at /lib/python3.7/site-packages/ludwig/models/model.py:216) ]]

Original stack trace for 'save/RestoreV2': File "/bin/ludwig", line 10, in sys.exit(main()) File "/lib/python3.7/site-packages/ludwig/cli.py", line 108, in main CLI() File "/lib/python3.7/site-packages/ludwig/cli.py", line 64, in init getattr(self, args.command)() File "/lib/python3.7/site-packages/ludwig/cli.py", line 89, in test test_performance.cli(sys.argv[2:]) File "/lib/python3.7/site-packages/ludwig/test_performance.py", line 164, in cli full_predict(vars(args)) File "/lib/python3.7/site-packages/ludwig/predict.py", line 89, in full_predict use_horovod=use_horovod) File "/lib/python3.7/site-packages/ludwig/models/model.py", line 1677, in load_model_and_definition model = Model.load(model_dir, use_horovod=use_horovod) File "/lib/python3.7/site-packages/ludwig/models/model.py", line 1380, in load model = Model(use_horovod=use_horovod, hyperparameters) File "/lib/python3.7/site-packages/ludwig/models/model.py", line 113, in init *kwargs File "/lib/python3.7/site-packages/ludwig/models/model.py", line 216, in build self.saver = tf.compat.v1.train.Saver() File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 825, in init__ self.build() File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 837, in build self._build(self._filename, build_save=True, build_restore=True) File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 875, in _build build_restore=build_restore) File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal restore_sequentially, reshape) File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps restore_sequentially) File "/lib/python3.7/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2 name=name) File "/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()


donfour10 commented 4 years ago

I don't know exactly, but my assumption is that the test command doesn't recognize the batch size. Because other experiments (with less computational effort) working with the GPU test.

Is that possible?

w4nderlust commented 4 years ago

Will look deeper into it, but can you please try with batch_size: 1 and eval_batch_size: 1?

donfour10 commented 4 years ago

I'm currently running a training like that! Just to be safe: You mean to set these in the model_definition, right?

donfour10 commented 4 years ago

Same error and behavior as before. Training completed but test/predict failed with the same error message. Keep me posted if you find anything! Thanks :)

w4nderlust commented 4 years ago

Can you please provide me a reproducible example? Something I can run myself and breaks. it may be a python script or a yaml + dataset + command combination. If your data is private you can use the script in data/dataset_synthesyzer.py to obtain synthetic data. Also, what GPU are you using? What amount of RAM does it have?

donfour10 commented 4 years ago

GPU:

image

Data + Model_definition + Commands/Script -> https://drive.google.com/open?id=1Lkf_PxvVli5SjoMJM4ny23mTcxAobE-1

bert model: https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip

For the training you can either run the python script or the command in the .txt file.

Probably you have to change the paths to the bert model in the YAML-File (or in python script) and the paths to the data (in commands or python script).

Thank's for looking into it and I hope it's okay how I provide the data and other files to you!

PS: So for me it always breaks off at the test or predict after the training with the error I posted above.

donfour10 commented 4 years ago

@w4nderlust Hey, I tried a bit if I find a way to predict on the model. Now I can predict on a BERT model I trained, but it's working olny on CPU. So I trained on GPU because on CPU it will last weeks :) and after that i use a predict/test command or python file on CPU to predict on that model. Even the prediction last a few hours on CPU for the amount of data I tried it with. But I still don't get why the training is working on GPU but the predict/test not. Normally it can't be more computational effort as the training or am I thinking wrong? As well because within the training there is a test. Do you have any suggestions/assumptions for me or did you find something wrong in the data I provided?

w4nderlust commented 4 years ago

Thanks for the update. Yes there should be no reason why it should work on CPU and not on GPU. I looked at the data and it looked fine. I haven't had the time yet to dig deeper unfortunately (there's a lot going on with Ludwig at the moment). I'll get back to you as soon as I can.

donfour10 commented 4 years ago

No problem. Thanks for looking into it! :)

w4nderlust commented 4 years ago

@donfour10 a quick update: we are porting the entire Lduwig codebase to TF2, and by doing that we will also entirely change the BERT implementation. So if you don't mind, I'll postpone fixing this after the porting, as the new implementation would likely already fix it. If it doesn't, I'll make sure to fix it before releasing.

donfour10 commented 4 years ago

@w4nderlust Thanks for the update! I'm totally okay with that. I as well now found a way to use another bert model now on the GPU. However, I still don't know why it wasn't working with others properly.

w4nderlust commented 4 years ago

Another update: the new BERT encoder and all the other transformers are now added to master. you can check them out! This issue does no longer apply, so I'm closing it, but feel free to open a new one if you find issues with the new ones.