qqwweee / keras-yolo3

A Keras implementation of YOLOv3 (Tensorflow backend)
MIT License
7.14k stars 3.45k forks source link

problem with mutil GPU in keras #129

Open pzxdd opened 6 years ago

pzxdd commented 6 years ago

when i try to use model = multi_gpu_model(model,gpus=3) in my data,there is a error occured:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Can't concatenate scalars (use tf.stack instead) for 'yolo_loss_1/concat' (op: 'ConcatV2') with input shapes: [], [], [], [].

my enviroment is tensorflow-1.8 gpu, keras 2.20,titan xp please help me fix it! thx!

tanakataiki commented 6 years ago

Did you use it in training script?

pzxdd commented 6 years ago

@tanakataiki yes. I added one line in train.py just mentioned above

tanakataiki commented 6 years ago

Well I have the same problem In training and I am going to work onto it when i have time

pzxdd commented 6 years ago

do you have any clue?

pzxdd commented 6 years ago

@tanakataiki @qqwweee

tanakataiki commented 6 years ago

Humm ... I am not pretty sure but , I think we need to decide the input image , batch size before the network construction in this case?

FlyEgle commented 6 years ago

you should not use the concateate but use the add, beacause the last layers outputs is shape=(1,) tensor can not use the concateate . It need to add their from 3 single GPUS and get the sum of them.

nakasu commented 6 years ago

Although I don't think it is the best method, this way it worked. Please rewrite it as follows.

https://github.com/qqwweee/keras-yolo3/blob/da7d756b0e47b979e701f0131ba7074ea138add8/yolo3/model.py#L412

return K.expand_dims(loss, axis=0)

https://github.com/qqwweee/keras-yolo3/blob/da7d756b0e47b979e701f0131ba7074ea138add8/train.py#L53-L55

https://github.com/qqwweee/keras-yolo3/blob/da7d756b0e47b979e701f0131ba7074ea138add8/train.py#L73

'yolo_loss': lambda y_true, y_pred: y_pred[0]

boyliwensheng commented 6 years ago

i also has this problem,sad……

mazatov commented 6 years ago

I followed your example @nakasu , and it got rid of the error messages. However, I don't really see an improvement in batch speed. Were you able to get significant speed improvements with these? Wonder if I need to cchange the generator in someway to feed the data appropriately.

This is what I get when I call for parallel_model.summary() on gpus=4. It seems not correct as an input layer is split weirdly in 4 input layers..

parallel_model.summary()

Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, None, 3 0                                            
input_2 (InputLayer)            (None, 25, 25, 3, 6) 0                                            
input_3 (InputLayer)            (None, 50, 50, 3, 6) 0                                            
input_4 (InputLayer)            (None, 100, 100, 3,  0                                            
lambda_1 (Lambda)               (None, None, None, 3 0           input_1[0][0]                    
lambda_2 (Lambda)               (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_3 (Lambda)               (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_4 (Lambda)               (None, 100, 100, 3,  0           input_4[0][0]                    
lambda_5 (Lambda)               (None, None, None, 3 0           input_1[0][0]                    
lambda_6 (Lambda)               (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_7 (Lambda)               (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_8 (Lambda)               (None, 100, 100, 3,  0           input_4[0][0]                    
lambda_9 (Lambda)               (None, None, None, 3 0           input_1[0][0]                    
lambda_10 (Lambda)              (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_11 (Lambda)              (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_12 (Lambda)              (None, 100, 100, 3,  0           input_4[0][0]                    
lambda_13 (Lambda)              (None, None, None, 3 0           input_1[0][0]                    
lambda_14 (Lambda)              (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_15 (Lambda)              (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_16 (Lambda)              (None, 100, 100, 3,  0           input_4[0][0]                    
model_3 (Model)                 (None, 1)            61576342    lambda_1[0][0]                   
                                                                 lambda_2[0][0]                   
                                                                 lambda_3[0][0]                   
                                                                 lambda_4[0][0]                   
                                                                 lambda_5[0][0]                   
                                                                 lambda_6[0][0]                   
                                                                 lambda_7[0][0]                   
                                                                 lambda_8[0][0]                   
                                                                 lambda_9[0][0]                   
                                                                 lambda_10[0][0]                  
                                                                 lambda_11[0][0]                  
                                                                 lambda_12[0][0]                  
                                                                 lambda_13[0][0]                  
                                                                 lambda_14[0][0]                  
                                                                 lambda_15[0][0]                  
                                                                 lambda_16[0][0]                  

yolo_loss (Concatenate)         (None, 1)            0           model_3[1][0]                    
                                                                 model_3[2][0]                    
                                                                 model_3[3][0]                    
                                                                 model_3[4][0]                    
Total params: 61,576,342
Trainable params: 32,310
Non-trainable params: 61,544,032
leirobertshi commented 6 years ago

I followed your example @nakasu , and it got rid of the error messages. However, I don't really see an improvement in batch speed. Were you able to get significant speed improvements with these? Wonder if I need to cchange the generator in someway to feed the data appropriately.

This is what I get when I call for parallel_model.summary() on gpus=4. It seems not correct as an input layer is split weirdly in 4 input layers..

parallel_model.summary()

Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, None, 3 0                                            
input_2 (InputLayer)            (None, 25, 25, 3, 6) 0                                            
input_3 (InputLayer)            (None, 50, 50, 3, 6) 0                                            
input_4 (InputLayer)            (None, 100, 100, 3,  0                                            
lambda_1 (Lambda)               (None, None, None, 3 0           input_1[0][0]                    
lambda_2 (Lambda)               (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_3 (Lambda)               (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_4 (Lambda)               (None, 100, 100, 3,  0           input_4[0][0]                    
lambda_5 (Lambda)               (None, None, None, 3 0           input_1[0][0]                    
lambda_6 (Lambda)               (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_7 (Lambda)               (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_8 (Lambda)               (None, 100, 100, 3,  0           input_4[0][0]                    
lambda_9 (Lambda)               (None, None, None, 3 0           input_1[0][0]                    
lambda_10 (Lambda)              (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_11 (Lambda)              (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_12 (Lambda)              (None, 100, 100, 3,  0           input_4[0][0]                    
lambda_13 (Lambda)              (None, None, None, 3 0           input_1[0][0]                    
lambda_14 (Lambda)              (None, 25, 25, 3, 6) 0           input_2[0][0]                    
lambda_15 (Lambda)              (None, 50, 50, 3, 6) 0           input_3[0][0]                    
lambda_16 (Lambda)              (None, 100, 100, 3,  0           input_4[0][0]                    
model_3 (Model)                 (None, 1)            61576342    lambda_1[0][0]                   
                                                                 lambda_2[0][0]                   
                                                                 lambda_3[0][0]                   
                                                                 lambda_4[0][0]                   
                                                                 lambda_5[0][0]                   
                                                                 lambda_6[0][0]                   
                                                                 lambda_7[0][0]                   
                                                                 lambda_8[0][0]                   
                                                                 lambda_9[0][0]                   
                                                                 lambda_10[0][0]                  
                                                                 lambda_11[0][0]                  
                                                                 lambda_12[0][0]                  
                                                                 lambda_13[0][0]                  
                                                                 lambda_14[0][0]                  
                                                                 lambda_15[0][0]                  
                                                                 lambda_16[0][0]                  

yolo_loss (Concatenate)         (None, 1)            0           model_3[1][0]                    
                                                                 model_3[2][0]                    
                                                                 model_3[3][0]                    
                                                                 model_3[4][0]                    
Total params: 61,576,342
Trainable params: 32,310
Non-trainable params: 61,544,032

same here. make 4 gpus running but no gain in training time.

power630 commented 6 years ago

add multi_gpu_model on yolo_body (before yolo_loss) may work

    model_body = tiny_yolo_body(image_input, num_anchors//2, num_classes)
    if gpus > 1:
        model_body = multi_gpu_model(model_body, gpus=gpus)
matt-deboer commented 5 years ago

I can confirm that @nakasu 's solution works (assuming you've applied this)

@leirobertshi , @mazatov did you increase the batch size to account for your number of gpus?

if gpus > 1:
   batch_size *= gpus
mazatov commented 5 years ago

I can confirm that @nakasu 's solution works (assuming you've applied this)

@leirobertshi , @mazatov did you increase the batch size to account for your number of gpus?

if gpus > 1:
   batch_size *= gpus

Thanks, I'll try this out. I don't think I was changing the batch_size!

wangzilu commented 5 years ago

@nakasu Although this method worked, however I don't think this solution work in a correctly way. Because multi_gpu_model concat the loss together which calculated by each sub batchsize, and in the model compile step, the loss function only select the first sub batchsize's loss, and ignored the others.

wangzilu commented 5 years ago

@FlyEgle I tried your method on docker, but I dont know what the reason caused the program junk at the epoch 1, until the pip out of time, no any response.

LiLong1105 commented 5 years ago

when i try to use model = multi_gpu_model(model,gpus=3) in my data,there is a error occured:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Can't concatenate scalars (use tf.stack instead) for 'yolo_loss_1/concat' (op: 'ConcatV2') with input shapes: [], [], [], [].

my enviroment is tensorflow-1.8 gpu, keras 2.20,titan xp please help me fix it! thx!

my enviroment is tensorflow-1.8 gpu, keras 2.20, nvidia V100 . but i can not run on GPU even though the GPU memory has been taken up. by the way it can run on cpu.

so,I want to know if you can run on GPU(tensorflow-1.8 gpu, keras 2.20).

thanks!

zhuolyang commented 5 years ago

@power630 Hi, I tried the method you said, it works for me. thank you!

add multi_gpu_model on yolo_body (before yolo_loss) may work

    model_body = tiny_yolo_body(image_input, num_anchors//2, num_classes)
    if gpus > 1:
        model_body = multi_gpu_model(model_body, gpus=gpus)
power630 commented 5 years ago

@power630 Hi, I tried the method you said, it works for me. thank you!

add multi_gpu_model on yolo_body (before yolo_loss) may work

    model_body = tiny_yolo_body(image_input, num_anchors//2, num_classes)
    if gpus > 1:
        model_body = multi_gpu_model(model_body, gpus=gpus)

Unfortunately, it only works in training stage. The final weights cannot be loaded directly by model.load_weights... @zhuolyang

huangbiubiu commented 5 years ago

add multi_gpu_model on yolo_body (before yolo_loss) may work

    model_body = tiny_yolo_body(image_input, num_anchors//2, num_classes)
    if gpus > 1:
        model_body = multi_gpu_model(model_body, gpus=gpus)

Does it mean do not use multi_gpu_model on yolo_loss?

power630 commented 5 years ago

add multi_gpu_model on yolo_body (before yolo_loss) may work

    model_body = tiny_yolo_body(image_input, num_anchors//2, num_classes)
    if gpus > 1:
        model_body = multi_gpu_model(model_body, gpus=gpus)

Does it mean do not use multi_gpu_model on yolo_loss?

yes, the complete network contains multiple bodies and single loss. It works in training process. However, the saved model CANNOT be load by model.load_weights directly.

huangbiubiu commented 5 years ago

@power630 Can not load the model is a big problem. Is there any method to load and save model?

power630 commented 5 years ago

@power630 Can not load the model is a big problem. Is there any method to load and save model?

i've not solved yet

ladybirdhui commented 5 years ago

i find this
maybe help https://www.bountysource.com/issues/60494331-the-multi_gpu_model-problem

ladybirdhui commented 5 years ago

https://www.bountysource.com/issues/60494331-the-multi_gpu_model-problem Open it with a browser, but click it directly doesn't seem to work

fourth-archive commented 5 years ago

@pzxdd @tanakataiki @nakasu @boyliwensheng @ladybirdhui this YOLOv3 tutorial may help you: https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data

The accompanying repository works on MacOS, Windows and Linux, includes multigpu and multithreading, performs inference on images, videos, webcams, and an iOS app. It also tests to slightly higher mAPs than darknet, including on the latest YOLOv3-SPP.weights (60.7 COCO mAP), and offers the ability to train custom datasets from scratch to darknet performance, all using PyTorch :) https://github.com/ultralytics/yolov3



zhanganguo commented 5 years ago

I have pushed the useable code for training multi_gpu_model for YOLOv3 with multiple backbones, please visit https://github.com/anvien/Multi-YOLOv3

InfiniteLife commented 5 years ago

@nakasu Although this method worked, however I don't think this solution work in a correctly way. Because multi_gpu_model concat the loss together which calculated by each sub batchsize, and in the model compile step, the loss function only select the first sub batchsize's loss, and ignored the others.

Why do you think model.compile will select only the first sub batchsize's loss? I assume that from multi_gpu_model we will receive aggregated and averaged output, or aggregation will happen in model.compile?..