google / automl

Google Brain AutoML
Apache License 2.0
6.24k stars 1.45k forks source link

GPU train #29

Closed silence0628 closed 4 years ago

silence0628 commented 4 years ago

When running this line of code "python main.py --training_filepattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False", the following error occurs ERROR:tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training. What is the reason?

CraigWang1 commented 4 years ago

Just wondering, have you been able to train for epochs or get the training process started?

silence0628 commented 4 years ago

Yes, training has started, and the following information appears

I0324 11:48:15.718143 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.59419 INFO:tensorflow:examples/sec: 6.37677 I0324 11:48:15.718615 140160666445632 tpu_estimator.py:2160] examples/sec: 6.37677 INFO:tensorflow:global_step/sec: 1.60237 I0324 11:48:16.342203 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.60237 INFO:tensorflow:examples/sec: 6.40949 I0324 11:48:16.342702 140160666445632 tpu_estimator.py:2160] examples/sec: 6.40949 INFO:tensorflow:global_step/sec: 1.58549 I0324 11:48:16.972882 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.58549 INFO:tensorflow:examples/sec: 6.34196

But a few minutes later, NaN appeared

tabsun commented 4 years ago

@silence0628 My training is really slow with 1080Ti and what GPU are you using?

INFO:tensorflow:global_step/sec: 0.00882378 I0324 15:34:16.805580 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.00882378 INFO:tensorflow:examples/sec: 0.564722 I0324 15:34:16.805948 140436817930048 tpu_estimator.py:2308] examples/sec: 0.564722 INFO:tensorflow:global_step/sec: 0.0100131 I0324 15:35:56.674830 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.0100131 INFO:tensorflow:examples/sec: 0.640839 I0324 15:35:56.675580 140436817930048 tpu_estimator.py:2308] examples/sec: 0.640839 INFO:tensorflow:global_step/sec: 0.00949146

ancorasir commented 4 years ago

I met the same problem with @silence0628 after 1200 steps using my own training dataset. The loss in the summary file looks alright and is decreasing before the crash. I'm running the code on my local GPU. I noticed that the learning rate is scheduled using the following parameters. Is it possible that the learning rate is too big?

optimization

h.momentum = 0.9 h.learning_rate = 0.08 h.lr_warmup_init = 0.008 h.lr_warmup_epoch = 1.0 h.first_lr_drop_epoch = 200.0 h.second_lr_drop_epoch = 250.0 h.clip_gradients_norm = 10.0 h.num_epochs = 300

Byronnar commented 4 years ago

Are you really train with your GPU rather than cpu? I also run python main.py --training_filepattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False".. But it seems that it will use CPU instead of GPU, if we set use_tpu FALSE

tabsun commented 4 years ago

@Byronnar I think so. But as the tf doc says: TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.

And the training only take up 147M memory of each GPU. It's really strange.

Byronnar commented 4 years ago

@Byronnar I think so. But as the tf doc says: TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.

And the training only take up 147M memory of each GPU. It's really strange. 谢谢回复,这个确实很奇怪,原来是只用了147m, 我还以为一点GPU没用. 后续再看看吧

CraigWang1 commented 4 years ago

With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.

W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'

Anyone else have the same issue?

mingxingtan commented 4 years ago

@CraigWang1 Oh sorry, I disabled the log info in main.py. You can either add "--logtostderr" or remove the disable line.

silence0628 commented 4 years ago

@tabsun my gpu is RTX2080ti, it should be faster. Have you finished your training?

tabsun commented 4 years ago

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

bitwangdan commented 4 years ago

@silence0628 Hi, have you solved this problem? same error

silence0628 commented 4 years ago

@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'tuple' object has no attribute 'name' i dont know the reason

bitwangdan commented 4 years ago

@ancorasir Hi, have you solved this problem? NaN loss during training

dx111 commented 4 years ago

@ancorasir Hi, have you solved this problem? NaN loss during training Hi, have you solved this problem? NaN loss during training

CraigWang1 commented 4 years ago

@silence0628 After changing logging.set_verbosity(logging.WARNING) to logging.set_verbosity(logging.INFO) in main.py, I was able to see the steps. However, I can't see the loss.

tabsun commented 4 years ago

Why not using tensorboard to check the loss?

watertianyi commented 4 years ago

My CUDA version is 10.1.243. I can use tensorflow-gpu1.15.0 + efficient det-d0 to infer the effect. There is a problem in tensorflow-gpu2.1.0 reasoning. Would you like to ask about the CUDA version and tensorflow GPU version that this code supports GPU?

ancorasir commented 4 years ago

@dx9527 @bitwangdan , Not yet. I tried lower the learning rate, and I was able to train efficientdet-1 with my own dataset for 66000 iteration steps before Nan loss appeared. I'm still working on finding a good learning rate. Any Suggestions? @mingxingtan

python main.py --training_file_pattern=./tfrecords/*.tfrecord \ --model_dir=./output \ --hparams="use_bfloat16=false,num_classes=104,skip_crowd_during_training = False" \ --use_tpu=False \ --backbone_ckpt=./efficientnet-b1 \ --train_batch_size=8 \ --num_examples_per_epoch=82990 \ --num_epochs=15

optimization

h.momentum = 0.9 h.learning_rate = 0.001 h.lr_warmup_init = 0.0001 h.lr_warmup_epoch = 1.0 h.first_lr_drop_epoch = 200.0 h.second_lr_drop_epoch = 250.0

Screenshot from 2020-03-27 09-47-34

mad-fogs commented 4 years ago

With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.

W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'

Anyone else have the same issue?

i have got the same warning.

sunzhe09 commented 4 years ago

me too

mingxingtan commented 4 years ago

This warning doesn't matter. Due to my laziness, I was using the old way of adding summaries to tf.collections, which causes this warning. I have just submitted a simple change to avoid this warning.

Li505358678 commented 4 years ago

I want to know if the training code with GPU is the same as before?I want to reproduce the results in the paper. The Eval on Coco 2017 Val of results which were got from the model provided are consistent, but the results of the model evaluation trained by myself are very poor. I don't know the reason

shenxiaofei715 commented 4 years ago

我自己研究的一下如何训练自己的数据请看 https://github.com/shenxiaofei715/efficientdet.git

elv-xuwen commented 4 years ago

@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'tuple' object has no attribute 'name' i dont know the reason

Hi @silence0628 can you tell me how you solved the NaN loss issue? Thank you!

C-SJK commented 4 years ago

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

C-SJK commented 4 years ago

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

@tabsun

tabsun commented 4 years ago

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

@tabsun

It's almost 5 months ago that I trained my model using this repo. After I used SINGLE GPU to train, the speed is normal I remember. But mAP is not satisfying.