rishizek / tensorflow-deeplab-v3-plus

DeepLabv3+ built in TensorFlow
MIT License
833 stars 307 forks source link

NanLossDuringTrainingError: NaN loss during training #105

Open Wuxinxiaoshifu opened 2 years ago

Wuxinxiaoshifu commented 2 years ago

INFO:tensorflow:Using config: {'_model_dir': '/home/yzh/v3plus/tensorflow-deeplab-v3-plus/dataset/test2/model/new', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 1000000000.0, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff57ace0c88>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} INFO:tensorflow:Start training. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2021-11-02 23:17:44.611893: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2021-11-02 23:17:44.828950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755 pciBusID: 0000:02:00.0 totalMemory: 23.70GiB freeMemory: 23.44GiB 2021-11-02 23:17:44.967547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755 pciBusID: 0000:81:00.0 totalMemory: 23.69GiB freeMemory: 23.27GiB 2021-11-02 23:17:44.967602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2021-11-02 23:21:49.062821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-11-02 23:21:49.062859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2021-11-02 23:21:49.062867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N 2021-11-02 23:21:49.062871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N 2021-11-02 23:21:49.063044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22724 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6) 2021-11-02 23:21:49.063425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22555 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:81:00.0, compute capability: 8.6) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into /home/yzh/v3plus/tensorflow-deeplab-v3-plus/dataset/test2/model/new/model.ckpt. INFO:tensorflow:cross_entropy = 1.9338539, learning_rate = 0.007, train_mean_iou = 0.014417753, train_px_accuracy = 0.086506516 INFO:tensorflow:loss = 24.278753, step = 0 ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "train.py", line 285, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 267, in main hooks=train_hooks, File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default saving_listeners) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run run_metadata=run_metadata) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run raise six.reraise(original_exc_info) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/six.py", line 719, in reraise raise value File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run return self._sess.run(args, **kwargs) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1320, in run run_metadata=run_metadata)) File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run raise NanLossDuringTrainingError tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.