Question About batch size in distributed training

leondgarse / Keras_insightface

Insightface Keras implementation

MIT License

230 stars 56 forks source link

Question About batch size in distributed training #119

Closed PR451 closed 1 year ago

PR451 commented 1 year ago

I am using 7 gpus for training and keeping batch size as 256. But in logs I am seeing following line

L2 regularizer value from basic_model: 0 num_replicas_in_sync: 7, batch_size: 6272 Init type by loss function name... Train arcface... Init softmax dataset...

How is batch_size size calculated here?

Also, from my understanding lr should be multiplied by no of gpus. Do you have any suggestions on lr?

leondgarse commented 1 year ago

It's calculated and printed on line train.py#L116-L120, just batch_size * strategy.num_replicas_in_sync without any other modification, it shouldn't be wrong.
Ya, basically lr should be modified according to batch_size, thus in distributed training, should be multiplied by no of gpus. My experience in distributed training is very limited, and just 2 GPUs...

PR451 commented 1 year ago

Thanks a lot. I was manually multiplying the batch_size * strategy.num_replicas_in_sync to increase the batch size.