salesforce / ctrl

Conditional Transformer Language Model for Controllable Generation
https://arxiv.org/abs/1909.05858
BSD 3-Clause "New" or "Revised" License
1.87k stars 208 forks source link

I can do inference with pretrained model but face error when finetuning. #40

Closed LiuYixian closed 5 years ago

LiuYixian commented 5 years ago

Hi, Thank you for your paper and model. I can do inference with your pretrained model but the output is strange when finetuning. The envs is as follows, tensorflow 1.14.0-gpu python 3.6.8 one GPU Tesla M40 24GB CUDA 10.0

The error is as following:

019-10-11` 10:23:29.882912: I tensorflow/core/common_runtime/placer.cc:54] report_uninitialized_resources_1/Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0

concat_1/axis: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.882949: I tensorflow/core/common_runtime/placer.cc:54] concat_1/axis: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/filename/input: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.882980: I tensorflow/core/common_runtime/placer.cc:54] save/filename/input: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/StringJoin/inputs_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883008: I tensorflow/core/common_runtime/placer.cc:54] save/StringJoin/inputs_1: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/num_shards: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883041: I tensorflow/core/common_runtime/placer.cc:54] save/num_shards: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/ShardedFilename/shard: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883073: I tensorflow/core/common_runtime/placer.cc:54] save/ShardedFilename/shard: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/SaveV2/tensor_names: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883101: I tensorflow/core/common_runtime/placer.cc:54] save/SaveV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/SaveV2/shape_and_slices: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883130: I tensorflow/core/common_runtime/placer.cc:54] save/SaveV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/RestoreV2/tensor_names: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883163: I tensorflow/core/common_runtime/placer.cc:54] save/RestoreV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/RestoreV2/shape_and_slices: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:29.883195: I tensorflow/core/common_runtime/placer.cc:54] save/RestoreV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0 2019-10-11 10:23:46.755980: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. ERROR:tensorflow:Error recorded from training_loop: moby_dick.txt.tfrecords/graph.pbtxt.tmp1a94ee91758c42f88fddd51c196b3dbe; Not a directory INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "/root/liuyx/ctrl/training_utils/training.py", line 164, in estimator_model.train(input_fn=input_fn, steps=args.iterations) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train rendezvous.raise_errors() File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors six.reraise(typ, value, traceback) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train saving_listeners=saving_listeners) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default saving_listeners) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1480, in _train_with_estimator_spec log_step_count_steps=log_step_count_steps) as mon_sess: File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 584, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in init stop_grace_period_secs=stop_grace_period_secs) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 725, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1200, in init _WrappedSession.init(self, self._create_session()) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1205, in _create_session return self._sess_creator.create_session() File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 878, in create_session hook.after_create_session(self.tf_sess, self.coord) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 572, in after_create_session self._checkpoint_dir, "graph.pbtxt") File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/framework/graph_io.py", line 72, in write_graph graph_def, float_format='')) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 538, in atomic_write_string_to_file write_string_to_file(temp_pathname, contents) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 347, in write_string_to_file f.write(file_content) File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 106, in write self._prewrite_check() File "/root/anaconda3/envs/ctrl/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 92, in _prewrite_check compat.as_bytes(self.__name), compat.as_bytes(self.__mode)) tensorflow.python.framework.errors_impl.FailedPreconditionError: moby_dick.txt.tfrecords/graph.pbtxt.tmp1a94ee91758c42f88fddd51c196b3dbe; Not a directory

Process finished with exit code 1

Could you help me with it. Thank you.

Yixian

keskarnitish commented 5 years ago

Can you post more details about what file you're fine-tuning on and the command you ran?

This part of your log is a bit odd ERROR:tensorflow:Error recorded from training_loop: moby_dick.txt.tfrecords/graph.pbtxt.tmp1a94ee91758c42f88fddd51c196b3dbe; Not a directory

LiuYixian commented 5 years ago

The error is from my mistake when run the fine-tuning. It is fixed now. But, I'm faced with a OOM error, now. I'll close this issue and open another one about OOM.

Thank you.