tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.77k forks source link

deeplapv3+ train loss NAN #10774

Closed sxj731533730 closed 2 years ago

sxj731533730 commented 2 years ago

when i am training deeplabv3,i enconuter some error,i use the dataset from coco only include person

i paste the prepared training stage and the started training process

ubuntu@ubuntu:~$ conda create -n tf1.15 python=3.6 ubuntu@ubuntu:~$ conda activate tf1.15 (tf1.15) ubuntu@ubuntu:~$ git clone https://github.com/tensorflow/models.git

(tf1.15) ubuntu@ubuntu:~$ pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.15.0 tensorflow==1.15.0 (tf1.15) ubuntu@ubuntu:~$ python3 Python 3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf tf.test.is_gpu_available() True

(tf1.15) ubuntu@ubuntu:~$ git clone https://github.com/wkentaro/labelme.git

ubuntu@ubuntu:~/Downloads/dataset$ tree -L 1 . ├── train ├── trainval └── val

3 directories, 0 files 图片的宽度是640 高度480

(tf1.15) ubuntu@ubuntu:~/labelme/examples/semantic_segmentation$ python3 labelme2voc.py /home/ubuntu/Downloads/dataset/train /home/ubuntu/Downloads/dataset/train_voc --labels labels.txt

(tf1.15) ubuntu@ubuntu:~/labelme/examples/semantic_segmentation$ python3 labelme2voc.py /home/ubuntu/Downloads/dataset/trainval /home/ubuntu/Downloads/dataset/trainval_voc --labels labels.txt (tf1.15) ubuntu@ubuntu:~/labelme/examples/semantic_segmentation$ python3 labelme2voc.py /home/ubuntu/Downloads/dataset/val /home/ubuntu/Downloads/dataset/val_voc --labels labels.txt

lable.txt ignore background person

(tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ python3 remove_gt_colormap.py --original_gt_folder=/home/ubuntu/Downloads/dataset/train_voc/SegmentationClassPNG --output_dir=/home/ubuntu/Downloads/dataset/train_voc/SegmentationClassRaw (tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ python3 remove_gt_colormap.py --original_gt_folder=/home/ubuntu/Downloads/dataset/val_voc/SegmentationClassPNG --output_dir=/home/ubuntu/Downloads/dataset/val_voc/SegmentationClassRaw (tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ python3 remove_gt_colormap.py --original_gt_folder=/home/ubuntu/Downloads/dataset/trainval_voc/SegmentationClassPNG --output_dir=/home/ubuntu/Downloads/dataset/trainval_voc/SegmentationClassRaw

find . -name ".jpg" > ../trainlist/train.txt find . -name ".jpg" > ../vallist/val.txt find . -name "*.jpg" > ../trainvallist/trainval.txt 使用文本替换的功能修正为txt只有文件名字列表,没有后缀名和文件夹路径

(tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ python3 build_voc2012_data.py --image_folder=/home/ubuntu/Downloads/dataset/train_voc/JPEGImages --semantic_segmentation_folder=/home/ubuntu/Downloads/dataset/train_voc/SegmentationClassRaw --list_folder=/home/ubuntu/Downloads/dataset/train_voc/trainlist --image_format="jpg" --output_dir=/home/ubuntu/models/research/deeplab/datasets/datasetData

(tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ python3 build_voc2012_data.py --image_folder=/home/ubuntu/Downloads/dataset/trainval_voc/JPEGImages --semantic_segmentation_folder=/home/ubuntu/Downloads/dataset/trainval_voc/SegmentationClassRaw --list_folder=/home/ubuntu/Downloads/dataset/trainval_voc/trainvallist --image_format="jpg" --output_dir=/home/ubuntu/models/research/deeplab/datasets/datasetData

(tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ python3 build_voc2012_data.py --image_folder=/home/ubuntu/Downloads/dataset/val_voc/JPEGImages --semantic_segmentation_folder=/home/ubuntu/Downloads/dataset/val_voc/SegmentationClassRaw --list_folder=/home/ubuntu/Downloads/dataset/val_voc/vallist --image_format="jpg" --output_dir=/home/ubuntu/models/research/deeplab/datasets/datasetData

/home/ubuntu/models/research/deeplab/datasets/data_generator.py

_MYDATA_INFORMATION = DatasetDescriptor( splits_to_sizes={ 'train': 869, # 训练集数量 'trainval': 532, # 训练集数量 'val': 140, # 测试集数量 }, num_classes=3,#ignore+background+Arrow =3 ignore_label=255, )

112行

_DATASETS_INFORMATION = { 'cityscapes': _CITYSCAPES_INFORMATION, 'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION, 'ade20k': _ADE20K_INFORMATION, 'mydata':_MYDATA_INFORMATION, # 添加自己的数据集 }

/home/ubuntu/models/research/deeplab/utils/train_utils.py

Variables that will not be restored.

exclude_list = ['global_step']

exclude_list = ['global_step','logits'] if not initialize_last_layer: exclude_list.extend(last_layers)

/home/ubuntu/models/research/deeplab/train.py

flags.DEFINE_boolean('initialize_last_layer', False, 'Initialize the last layer.')

flags.DEFINE_boolean('last_layers_contain_logits_only', True, 'Only consider logits as last layers or not.')

/home/ubuntu/models/research/deeplab/utils/get_dataset_colormap.py _DATASET_NAME='mydata' # 添加在这里,和注册的名字相同

_DATASET_NAME: 3,   # 在这里添加 colormap 的颜色数

def create_dataset_name_label_colormap(): return np.asarray([ [165, 42, 42], [0, 192, 0], [196, 196, 196], ])

elif dataset == _DATASET_NAME: # 添加在这里 return create_dataset_name_label_colormap()

(tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ wget -nd -c http://download.tensorflow.org/models/deeplabv3_mnv2_pascal_train_aug_2018_01_29.tar.gz (tf1.15) ubuntu@ubuntu:~/models/research/deeplab/datasets$ tar -zxvf deeplabv3_mnv2_pascal_train_aug_2018_01_29.tar.gz

(tf1.15) ubuntu@ubuntu:~/models/research$ CUDA_VISIBLE_DEVICES=0 python3 deeplab/train.py --logtostderr --num_clones=2 --training_number_of_steps=3000 --train_split="train" --model_variant="mobilenet_v2" --output_stride=8 --fine_tune_batch_norm=true --label_weights={0,0.1,10} --train_batch_size=2 --train_crop_size="481,641" --dataset="mydata" --tf_initial_checkpoint='/home/ubuntu/models/research/deeplab/datasets/deeplabv3_mnv2_pascal_train_aug/model.ckpt-30000' --train_logdir='/home/ubuntu/models/research/deeplab/datasets/result' --dataset_dir='/home/ubuntu/models/research/deeplab/datasets/datasetData'

but i encouter some error

INFO:tensorflow:Recording summary at step 0. I0910 21:44:20.737598 140448232363776 supervisor.py:1050] Recording summary at step 0. INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, 2 root error(s) found. (0) Invalid argument: Loss is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[concat_projection/BatchNorm/gamma/sum_grads/_1241]] (1) Invalid argument: Loss is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'CheckNumerics': File "deeplab/train.py", line 464, in tf.app.run() File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "deeplab/train.py", line 398, in main total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.') File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1011, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/ubuntu/miniconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

sushreebarsa commented 2 years ago

@sxj731533730 In order to expedite the trouble-shooting process, could you please provide the entire URL of the repository which you are using. Please provide more details on the issue reported here. Please make sure to use latest TF version as older versions are not actively supported. Thank you!

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

qtyandhasee commented 1 year ago

I have the same problem. How can I solve it