Out of system memory when unfreeze all of the layers.

FMsunyh commented 6 years ago

Hi, I have a issue when I unfreeze all of the layers. the memory keeps growing. but the train doesn't start to run. it seen that the train is stopped. And I try to change a smaller batch size. but it still likes this.

dasfaha commented 6 years ago

Same issue.

xugaoxiang commented 6 years ago

` (yolo) longjing@FR:~/Work/yolo3/keras-yolo3$ python train.py Using TensorFlow backend. 2018-06-15 16:07:02.816198: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Create YOLOv3 model with 9 anchors and 20 classes. /home/longjing/anaconda3/envs/yolo/lib/python3.6/site-packages/keras/engine/topology.py:3473: UserWarning: Skipping loading of weights for layer conv2d_59 due to mismatch in shape ((1, 1, 1024, 75) vs (255, 1024, 1, 1)). weight_values[i].shape)) /home/longjing/anaconda3/envs/yolo/lib/python3.6/site-packages/keras/engine/topology.py:3473: UserWarning: Skipping loading of weights for layer conv2d_59 due to mismatch in shape ((75,) vs (255,)). weight_values[i].shape)) /home/longjing/anaconda3/envs/yolo/lib/python3.6/site-packages/keras/engine/topology.py:3473: UserWarning: Skipping loading of weights for layer conv2d_67 due to mismatch in shape ((1, 1, 512, 75) vs (255, 512, 1, 1)). weight_values[i].shape)) /home/longjing/anaconda3/envs/yolo/lib/python3.6/site-packages/keras/engine/topology.py:3473: UserWarning: Skipping loading of weights for layer conv2d_67 due to mismatch in shape ((75,) vs (255,)). weight_values[i].shape)) /home/longjing/anaconda3/envs/yolo/lib/python3.6/site-packages/keras/engine/topology.py:3473: UserWarning: Skipping loading of weights for layer conv2d_75 due to mismatch in shape ((1, 1, 256, 75) vs (255, 256, 1, 1)). weight_values[i].shape)) /home/longjing/anaconda3/envs/yolo/lib/python3.6/site-packages/keras/engine/topology.py:3473: UserWarning: Skipping loading of weights for layer conv2d_75 due to mismatch in shape ((75,) vs (255,)). weight_values[i].shape)) Load weights model_data/yolo_weights.h5. Freeze the first 249 layers of total 252 layers. Train on 2251 samples, val on 250 samples, with batch size 32. Epoch 1/50

14/70 [=====>........................] - ETA: 1:00:09 - loss: 4176.9674 70/70 [==============================] - 3097s 44s/step - loss: 1155.2374 - val_loss: 152.2559 Epoch 2/50 70/70 [==============================] - 2301s 33s/step - loss: 112.3896 - val_loss: 82.8359 Epoch 3/50 70/70 [==============================] - 2301s 33s/step - loss: 69.7328 - val_loss: 58.9210 Epoch 4/50 70/70 [==============================] - 2295s 33s/step - loss: 51.3632 - val_loss: 44.8716 Epoch 5/50 70/70 [==============================] - 2298s 33s/step - loss: 42.1329 - val_loss: 39.3557 Epoch 6/50 70/70 [==============================] - 2300s 33s/step - loss: 36.1224 - val_loss: 33.6627 Epoch 7/50 70/70 [==============================] - 2296s 33s/step - loss: 32.3504 - val_loss: 30.6207 Epoch 8/50 70/70 [==============================] - 2297s 33s/step - loss: 29.2803 - val_loss: 28.9223 Epoch 9/50 70/70 [==============================] - 2298s 33s/step - loss: 27.4078 - val_loss: 25.2059 Epoch 10/50 70/70 [==============================] - 2295s 33s/step - loss: 26.0083 - val_loss: 24.3438 Epoch 11/50 70/70 [==============================] - 2295s 33s/step - loss: 24.5346 - val_loss: 23.5042 Epoch 12/50 70/70 [==============================] - 2296s 33s/step - loss: 23.6518 - val_loss: 22.3092 Epoch 13/50 70/70 [==============================] - 2298s 33s/step - loss: 22.6562 - val_loss: 21.7520 Epoch 14/50 70/70 [==============================] - 2297s 33s/step - loss: 21.8993 - val_loss: 22.0111 Epoch 15/50 70/70 [==============================] - 2296s 33s/step - loss: 21.3333 - val_loss: 20.7622 Epoch 16/50 70/70 [==============================] - 2295s 33s/step - loss: 20.9301 - val_loss: 21.6414 Epoch 17/50 70/70 [==============================] - 2292s 33s/step - loss: 20.3787 - val_loss: 20.2932 Epoch 18/50 70/70 [==============================] - 2295s 33s/step - loss: 20.0510 - val_loss: 19.9879 Epoch 19/50 70/70 [==============================] - 2298s 33s/step - loss: 19.4801 - val_loss: 18.7927 Epoch 20/50 70/70 [==============================] - 2293s 33s/step - loss: 19.4649 - val_loss: 18.6275 Epoch 21/50 70/70 [==============================] - 2294s 33s/step - loss: 19.1240 - val_loss: 18.8865 Epoch 22/50 70/70 [==============================] - 2295s 33s/step - loss: 18.8103 - val_loss: 18.5175 Epoch 23/50 70/70 [==============================] - 2297s 33s/step - loss: 18.4249 - val_loss: 18.3890 Epoch 24/50 70/70 [==============================] - 2297s 33s/step - loss: 18.0232 - val_loss: 17.8910 Epoch 25/50 70/70 [==============================] - 2295s 33s/step - loss: 18.1161 - val_loss: 17.8068 Epoch 26/50 70/70 [==============================] - 2295s 33s/step - loss: 18.0863 - val_loss: 17.5407 Epoch 27/50 70/70 [==============================] - 2294s 33s/step - loss: 17.5000 - val_loss: 16.9333 Epoch 28/50 70/70 [==============================] - 2294s 33s/step - loss: 17.4861 - val_loss: 17.3210 Epoch 29/50 70/70 [==============================] - 2294s 33s/step - loss: 17.3445 - val_loss: 17.2443 Epoch 30/50 70/70 [==============================] - 2291s 33s/step - loss: 17.1904 - val_loss: 17.0043 Epoch 31/50 70/70 [==============================] - 2290s 33s/step - loss: 16.9701 - val_loss: 16.6228 Epoch 32/50 70/70 [==============================] - 2293s 33s/step - loss: 16.9149 - val_loss: 17.3430 Epoch 33/50 70/70 [==============================] - 2292s 33s/step - loss: 16.4950 - val_loss: 16.4003 Epoch 34/50 70/70 [==============================] - 2290s 33s/step - loss: 16.9319 - val_loss: 17.0047 Epoch 35/50 70/70 [==============================] - 2292s 33s/step - loss: 16.8107 - val_loss: 16.5966 Epoch 36/50 70/70 [==============================] - 2290s 33s/step - loss: 16.5467 - val_loss: 15.9689 Epoch 37/50 70/70 [==============================] - 2291s 33s/step - loss: 16.5207 - val_loss: 15.9476 Epoch 38/50 70/70 [==============================] - 2291s 33s/step - loss: 16.3984 - val_loss: 17.2077 Epoch 39/50 70/70 [==============================] - 2294s 33s/step - loss: 16.2483 - val_loss: 16.6735 Epoch 40/50 70/70 [==============================] - 2291s 33s/step - loss: 16.2678 - val_loss: 15.8414 Epoch 41/50 70/70 [==============================] - 2292s 33s/step - loss: 16.3700 - val_loss: 16.4238 Epoch 42/50 70/70 [==============================] - 2292s 33s/step - loss: 16.1733 - val_loss: 16.3775 Epoch 43/50 70/70 [==============================] - 2293s 33s/step - loss: 15.9314 - val_loss: 15.8632 Epoch 44/50 70/70 [==============================] - 2289s 33s/step - loss: 16.2085 - val_loss: 15.7369 Epoch 45/50 70/70 [==============================] - 2291s 33s/step - loss: 15.8789 - val_loss: 15.2760 Epoch 46/50 70/70 [==============================] - 2289s 33s/step - loss: 16.1046 - val_loss: 16.3972 Epoch 47/50 70/70 [==============================] - 2289s 33s/step - loss: 15.9615 - val_loss: 15.7253 Epoch 48/50 70/70 [==============================] - 2291s 33s/step - loss: 15.8841 - val_loss: 15.5983 Epoch 49/50 70/70 [==============================] - 2293s 33s/step - loss: 15.8978 - val_loss: 15.9049 Epoch 50/50 70/70 [==============================] - 2295s 33s/step - loss: 15.5977 - val_loss: 15.8063 Unfreeze all of the layers. Train on 2251 samples, val on 250 samples, with batch size 32. Epoch 51/100 Killed ` it's the same question? But it seems gpu is not used when i train the VOC dataset.

FlyEgle commented 6 years ago

You could use the multi-gpu to train the unfreeze darknet53 model and use the small_batch avoid of OOM

FMsunyh commented 6 years ago

Set load_pretrained=False;
Use darknet53.weights to fine tuning;
Change a smaller batch size (batch_size=2);
Use 2 gpus to train.

I've tried these options. But it still doesn't work when unfreeze all layers. OOM is the system memory, not gpu's memory. but my system memory has 32G, I think that is enough. Could you give me some ideas? thanks! @FlyEgle @qqwweee

xudezhi123 commented 6 years ago

@FMsunyh Could you tell me how to unfreeze all of the layers? I will be appreciate a lot

jinxie0731 commented 6 years ago

My GPU is NVIDIA GTX 1080Ti(Single), when the batch_size is the origin value(namely 32), I met the same issue, but when I modified it to 10, train.py could accomplished its work without any error.

jinxie0731 commented 6 years ago

I think if you want to unfreeze all of the layers, you can try it like this: when you create the model, you can add this parameter(load_pretrained=False) for the function of create_model(): model = create_model(input_shape, anchors, num_classes, load_pretrained=False, freeze_body=2, weights_path='model_data/yolo_weights.h5')

Brizel commented 6 years ago

I set batch_size = 8 & epoch = 20 to solve this problem. Worked on my computer(1080Ti , 32G internal memory).

FMsunyh commented 6 years ago

@xudezhi123 you can check in the train.py, author trains model with frozen layers first, and then continues training with unfreeze all layers. I hope that can help you.

FMsunyh commented 6 years ago

@jinxie0731 Thanks. I've try this method( set load_pretrained=False ), but same issue with me.

sherlockchou86 commented 6 years ago

setting smaller batch_size will help

xugaoxiang commented 6 years ago

Ubuntu 18.04 64bit, GTX 1070 Ti, 8G, and 32G system memory, training VOC dataset, `

only train 2 classes, car and person, modify model_data/voc_classes.txt
batch_size = 2 & epoch = 20
set load_pretrained=False
` After a few minutes, memory hits 100%. Need help.

GeHongpeng commented 6 years ago

Update your tensorflow version(1.8.0). It works for me!

gittigxuy commented 6 years ago

@xugaoxiang @GeHongpeng ,请问由于内存溢出而导致中断之后，此时会产生很多临时文件，如何利用这些临时文件继续训练？

xugaoxiang commented 6 years ago

@GeHongpeng , Thanks.

Ubuntu 18.04 64bit, GTX 1070 Ti, 8G, and 32G system memory, training VOC dataset

`1 only train 2 classes, car and person, modify model_data/voc_classes.txt

2 batch_size = 2 & epoch = 20

3 set load_pretrained=False

4 update tensorflow to 1.8.0`

The train.py output

` Epoch 00063: ReduceLROnPlateau reducing learning rate to 9.999999747378752e-06. Epoch 64/100 1125/1125 [==============================] - 6684s 6s/step - loss: 12.6273 - val_loss: 12.4967 Epoch 65/100 1125/1125 [==============================] - 6686s 6s/step - loss: 12.4838 - val_loss: 12.1972 Epoch 66/100 1125/1125 [==============================] - 6690s 6s/step - loss: 12.1969 - val_loss: 13.0469

Epoch 00066: ReduceLROnPlateau reducing learning rate to 9.999999747378752e-07. Epoch 67/100 1125/1125 [==============================] - 6687s 6s/step - loss: 12.1996 - val_loss: 12.0781 Epoch 68/100 1125/1125 [==============================] - 6684s 6s/step - loss: 12.3424 - val_loss: 11.9006 Epoch 69/100 1125/1125 [==============================] - 6687s 6s/step - loss: 12.2405 - val_loss: 13.5473 Epoch 70/100 1125/1125 [==============================] - 6690s 6s/step - loss: 12.2212 - val_loss: 10.8682 Epoch 71/100 1125/1125 [==============================] - 6690s 6s/step - loss: 12.3795 - val_loss: 12.4388 Epoch 72/100 1125/1125 [==============================] - 6686s 6s/step - loss: 12.5838 - val_loss: 12.3046 Epoch 73/100 1125/1125 [==============================] - 6688s 6s/step - loss: 12.3020 - val_loss: 11.7841 Epoch 74/100 1125/1125 [==============================] - 6692s 6s/step - loss: 12.2491 - val_loss: 11.7993

Epoch 00074: ReduceLROnPlateau reducing learning rate to 9.999999974752428e-08. Epoch 75/100 1125/1125 [==============================] - 6694s 6s/step - loss: 12.1978 - val_loss: 12.5842 Epoch 76/100 1125/1125 [==============================] - 6694s 6s/step - loss: 12.3493 - val_loss: 12.2501 Epoch 77/100 1125/1125 [==============================] - 6693s 6s/step - loss: 12.4197 - val_loss: 11.5807

Epoch 00077: ReduceLROnPlateau reducing learning rate to 1.0000000116860975e-08. Epoch 78/100 1125/1125 [==============================] - 6686s 6s/step - loss: 12.1625 - val_loss: 11.9507 Epoch 79/100 1125/1125 [==============================] - 6694s 6s/step - loss: 12.0322 - val_loss: 11.9005 Epoch 80/100 1125/1125 [==============================] - 6694s 6s/step - loss: 12.4152 - val_loss: 13.1308

Epoch 00080: ReduceLROnPlateau reducing learning rate to 9.999999939225292e-10. Epoch 00080: early stopping ` It seems that something is wrong, I use the darknet53 to tune, it failed with following error.

(yolo) longjing@FR:~/Work/yolo3/keras-yolo3$ python convert.py darknet53.cfg logs/115/trained_weights_final.h5 model_data/yolo_voc_2.h5 Using TensorFlow backend. Traceback (most recent call last): File "convert.py", line 262, in <module> _main(parser.parse_args()) File "convert.py", line 64, in _main '.weights'), '{} is not a .weights file'.format(weights_path) AssertionError: logs/115/trained_weights_final.h5 is not a .weights file And I rename the .h5 file to .weights, convert success. But failed when run the script yolo.py.

` (yolo) longjing@FR:~/Work/yolo3/keras-yolo3$ python yolo.py Using TensorFlow backend. 2018-07-09 10:49:11.038987: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Traceback (most recent call last): File "yolo.py", line 218, in detect_img(YOLO()) File "yolo.py", line 33, in init self.boxes, self.scores, self.classes = self.generate() File "yolo.py", line 65, in generate num_anchors/len(self.yolo_model.output) * (num_classes + 5), \ TypeError: object of type 'Tensor' has no len()

`

GeHongpeng commented 6 years ago

@xugaoxiang No problem! If I got it right, maybe you used the convert script in a wrong way. darknet53.cfg just has only darknet53 convolutional layers, so the converted weight did not contain any yolo layers.

I trained the VOC dataset under Ubuntu 16.04 64bit, V100 16G, and 32G system memory. The difference between darknet and keras is still there, but it can detect most of the objects.

COCO dataset may be better.

GeHongpeng commented 6 years ago

@gittigxuy 你所说的临时文件都指哪些文件？

xugaoxiang commented 6 years ago

@GeHongpeng , and how to use the convert.py with the voc trained h5 file? Or should I use the trained_weights_final.h5 directly in yolo.py?

GeHongpeng commented 6 years ago

@xugaoxiang You can use the weight directly in the keras framework, you do not need to convert it. Because it was trained under keras framework.

You can convert the darknet53 weight which you can download from yolov3 website to do warming up in the first 30 or 50 epoches freezing the darknet53 backbone, then unfreeze it for further training.

xugaoxiang commented 6 years ago

@GeHongpeng , thank you.

I keep two classes in voc_classes.txt (car and person, same as classes in voc_annotation.py), but NO box was found using the trained_weights_final.h5. Why?

(yolo) longjing@FR:~/Work/yolo3/keras-yolo3$ python yolo.py Using TensorFlow backend. 2018-07-10 09:58:21.344637: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA logs/115/trained_weights_final.h5 model, anchors, and classes loaded. Input image filename:test_image/3.jpeg (416, 416, 3) Found 0 boxes for img 1.944707546965219 Input image filename:../darknet/data/person.jpg (416, 416, 3) Found 0 boxes for img 0.9323207149282098

GeHongpeng commented 6 years ago

@xugaoxiang How about the "score" parameter in the YOLOPredictor? You can set it to a low value such as 0.1 or lower, check the result.

If still no box was found, maybe your training was not enough or something wrong in the training.

I will do the same training later, and share the result.

xugaoxiang commented 6 years ago

@GeHongpeng , default score is 0.3. Same result when score is set to 0.1 and 0.01. Looking forward to your result.

GeHongpeng commented 6 years ago

@xugaoxiang I have been training voc dataset for about 45 epochs(use darknet53 weight, freeze only the backbone), the loss now is dropping below 10.(your loss seems not good)

After training 50 epochs, I will unfreeze all the layers for further training. I will report the result later.

GeHongpeng commented 6 years ago

@xugaoxiang Hi, My training loss is still dropping. I tested the current weight, here is the result. Not so good, but it can detect the car and the peroson.

voc_car person1 voc_car person2

xugaoxiang commented 6 years ago

@GeHongpeng , Great job, thanks, I'll do the training again.

GeHongpeng commented 6 years ago

@xugaoxiang If you have any questions, please let me know!

xugaoxiang commented 6 years ago

@GeHongpeng , Now , I use the darknet command line to train weigts on VOC dataset, and then convert the weights file to .h5 file. It's OK.

GeHongpeng commented 6 years ago

@xugaoxiang That’s great! You can use this keras version to train it next time!

QuntuamLoop commented 6 years ago

@GeHongpeng could you share your parameters in training, my loss was in 25 when i use voc2007 in training test by yolo-tiny.

GeHongpeng commented 6 years ago

@QuntuamLoop

Use darknet53 weight, freeze only the backbone, batch_size=32, Adam(lr=1e-3), Epochs=50
Unfreeze all layers, batch_size=16(it depends your GPU momery), Adam(lr=1e-4), Epochs=30
Unfreeze all layers, batch_size=16(it depends your GPU momery), SGD(lr=0.003, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5), Epochs=60

If needed, you can do further training. Such as changing the batch to 8 or 4, changing the SGD learning rate to 0.0003 or 0.00003.

Brizel commented 6 years ago

@GeHongpeng

Could you tell me your computer configuration ? My GPU memery is 12G, and I can only set batch_size = 8 to train VOC dataset. After 43 epochs, the final loss is 15.3.

GeHongpeng commented 6 years ago

@Brizel I used Tesla V100 to train this model. GPU memory is 16G. CPU cores is 8, CPU memory is 32G.

QuntuamLoop commented 6 years ago

thank your shareing, maybe I should ask my boss to update the 1070 card

发件人: GeHongpeng notifications@github.com 发送时间: 2018年7月18日 11:25 收件人: qqwweee/keras-yolo3 抄送: QuntuamLoop; Mention 主题: Re: [qqwweee/keras-yolo3] Out of system memory when unfreeze all of the layers. (#122)

@QuntuamLoophttps://github.com/QuntuamLoop

Use darknet53 weight, freeze only the backbone, batch_size=32, Adam(lr=1e-3), Epochs=50
Unfreeze all layers, batch_size=16(it depends your GPU momery), Adam(lr=1e-4), Epochs=30
Unfreeze all layers, batch_size=16(it depends your GPU momery), SGD(lr=0.003, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5), Epochs=60

If needed, you can do further training. Such as changing the batch to 8 or 4, changing the SGD learning rate to 0.0003 or 0.00003.

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/qqwweee/keras-yolo3/issues/122#issuecomment-405797043, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AeyggPw4iDaVKZODHtJXcIfGVgJzEyCQks5uHqqQgaJpZM4Uuq3h.

Samonsix commented 6 years ago

my tensorflow-gpu==1.5.0 GPU: Tesla P100, Mem：16G System memory:20G has the same question: after i unfreeze all layers, my process killed by system because of out of memory. when i update tensorflow-gpu==1.9.0, it's ok!

ss199302 commented 5 years ago

@GeHongpeng Hello！Did you modify the training method? Did it work better? Use darknet53 weight, freeze only the backbone, batch_size=32, Adam(lr=1e-3), Epochs=50 Unfreeze all layers, batch_size=16(it depends your GPU momery), Adam(lr=1e-4), Epochs=30 Unfreeze all layers, batch_size=16(it depends your GPU momery), SGD(lr=0.003, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5), Epochs=60

fourth-archive commented 5 years ago

@FMsunyh @dasfaha @xugaoxiang @FlyEgle @xudezhi123 this YOLOv3 tutorial may help you: https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data

The accompanying repository works on MacOS, Windows and Linux, includes multigpu and multithreading, performs inference on images, videos, webcams, and an iOS app. It also tests to slightly higher mAPs than darknet, including on the latest YOLOv3-SPP.weights (60.7 COCO mAP), and offers the ability to train custom datasets from scratch to darknet performance, all using PyTorch :) https://github.com/ultralytics/yolov3

Emma0928 commented 5 years ago

1559140429(1) I want to know why my progress directly skips the unfreeze stage.

remcova commented 5 years ago

I want to know why my progress directly skips the unfreeze stage.

How did you get the metric val_loss working? Could you show a code example? I've tried several times to get the validation loss but unfortunately without any results.

qqwweee / keras-yolo3

Out of system memory when unfreeze all of the layers. #122