rossumai / keras-multi-gpu

Multi-GPU data-parallel training in Keras
MIT License
77 stars 20 forks source link

loss stuck when using multi_gpu #4

Open burgalon opened 7 years ago

burgalon commented 7 years ago

I'm trying to use make_parallel() using Keras XCeption, and a generator which yields two classes, batch_size=2.

When using one gpu without make_parallel, the model gets to loss=0 acc=1 in 2 epochs. However, when using multi_gpu with gpus=2, the model gets stuck in acc=0.5 with loss=8.0591.

I'm guessing this is related somehow to the loss aggregation being collected only from one GPU instead of both, but I am not sure why.

When trying to train 4 classes, batch_size=4, the training gets to acc=0.97 after 11 epochs, while single gpu gets acc=1 within 2 epochs.

Any idea?

burgalon commented 7 years ago

also posted here https://github.com/fchollet/keras/issues/8200