Closed silence0628 closed 4 years ago
Just wondering, have you been able to train for epochs or get the training process started?
Yes, training has started, and the following information appears
I0324 11:48:15.718143 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.59419 INFO:tensorflow:examples/sec: 6.37677 I0324 11:48:15.718615 140160666445632 tpu_estimator.py:2160] examples/sec: 6.37677 INFO:tensorflow:global_step/sec: 1.60237 I0324 11:48:16.342203 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.60237 INFO:tensorflow:examples/sec: 6.40949 I0324 11:48:16.342702 140160666445632 tpu_estimator.py:2160] examples/sec: 6.40949 INFO:tensorflow:global_step/sec: 1.58549 I0324 11:48:16.972882 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.58549 INFO:tensorflow:examples/sec: 6.34196
But a few minutes later, NaN appeared
@silence0628 My training is really slow with 1080Ti and what GPU are you using?
INFO:tensorflow:global_step/sec: 0.00882378 I0324 15:34:16.805580 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.00882378 INFO:tensorflow:examples/sec: 0.564722 I0324 15:34:16.805948 140436817930048 tpu_estimator.py:2308] examples/sec: 0.564722 INFO:tensorflow:global_step/sec: 0.0100131 I0324 15:35:56.674830 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.0100131 INFO:tensorflow:examples/sec: 0.640839 I0324 15:35:56.675580 140436817930048 tpu_estimator.py:2308] examples/sec: 0.640839 INFO:tensorflow:global_step/sec: 0.00949146
I met the same problem with @silence0628 after 1200 steps using my own training dataset. The loss in the summary file looks alright and is decreasing before the crash. I'm running the code on my local GPU. I noticed that the learning rate is scheduled using the following parameters. Is it possible that the learning rate is too big?
h.momentum = 0.9 h.learning_rate = 0.08 h.lr_warmup_init = 0.008 h.lr_warmup_epoch = 1.0 h.first_lr_drop_epoch = 200.0 h.second_lr_drop_epoch = 250.0 h.clip_gradients_norm = 10.0 h.num_epochs = 300
Are you really train with your GPU rather than cpu? I also run python main.py --training_filepattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False".. But it seems that it will use CPU instead of GPU, if we set use_tpu FALSE
@Byronnar I think so. But as the tf doc says: TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.
And the training only take up 147M memory of each GPU. It's really strange.
@Byronnar I think so. But as the tf doc says: TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.
And the training only take up 147M memory of each GPU. It's really strange. 谢谢回复,这个确实很奇怪,原来是只用了147m, 我还以为一点GPU没用. 后续再看看吧
With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.
W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
Anyone else have the same issue?
@CraigWang1 Oh sorry, I disabled the log info in main.py. You can either add "--logtostderr" or remove the disable line.
@tabsun my gpu is RTX2080ti, it should be faster. Have you finished your training?
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
@silence0628 Hi, have you solved this problem? same error
@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'tuple' object has no attribute 'name' i dont know the reason
@ancorasir Hi, have you solved this problem? NaN loss during training
@ancorasir Hi, have you solved this problem? NaN loss during training Hi, have you solved this problem? NaN loss during training
@silence0628 After changing
logging.set_verbosity(logging.WARNING)
to logging.set_verbosity(logging.INFO)
in main.py, I was able to see the steps. However, I can't see the loss.
Why not using tensorboard to check the loss?
My CUDA version is 10.1.243. I can use tensorflow-gpu1.15.0 + efficient det-d0 to infer the effect. There is a problem in tensorflow-gpu2.1.0 reasoning. Would you like to ask about the CUDA version and tensorflow GPU version that this code supports GPU?
@dx9527 @bitwangdan , Not yet. I tried lower the learning rate, and I was able to train efficientdet-1 with my own dataset for 66000 iteration steps before Nan loss appeared. I'm still working on finding a good learning rate. Any Suggestions? @mingxingtan
python main.py --training_file_pattern=./tfrecords/*.tfrecord \ --model_dir=./output \ --hparams="use_bfloat16=false,num_classes=104,skip_crowd_during_training = False" \ --use_tpu=False \ --backbone_ckpt=./efficientnet-b1 \ --train_batch_size=8 \ --num_examples_per_epoch=82990 \ --num_epochs=15
h.momentum = 0.9 h.learning_rate = 0.001 h.lr_warmup_init = 0.0001 h.lr_warmup_epoch = 1.0 h.first_lr_drop_epoch = 200.0 h.second_lr_drop_epoch = 250.0
With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.
W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'tuple' object has no attribute 'name'
Anyone else have the same issue?
i have got the same warning.
me too
This warning doesn't matter. Due to my laziness, I was using the old way of adding summaries to tf.collections, which causes this warning. I have just submitted a simple change to avoid this warning.
I want to know if the training code with GPU is the same as before?I want to reproduce the results in the paper. The Eval on Coco 2017 Val of results which were got from the model provided are consistent, but the results of the model evaluation trained by myself are very poor. I don't know the reason
我自己研究的一下如何训练自己的数据请看 https://github.com/shenxiaofei715/efficientdet.git
@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'tuple' object has no attribute 'name' i dont know the reason
Hi @silence0628 can you tell me how you solved the NaN loss issue? Thank you!
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
Can I see your config file? My training speed in TiTan XP
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
Can I see your config file? My training speed in TiTan XP
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?
@tabsun
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
Can I see your config file? My training speed in TiTan XP
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?
@tabsun
It's almost 5 months ago that I trained my model using this repo. After I used SINGLE GPU to train, the speed is normal I remember. But mAP is not satisfying.
When running this line of code "python main.py --training_filepattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False", the following error occurs ERROR:tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training. What is the reason?