Closed xxiMiaxx closed 1 year ago
I have come across this issue often, I think the problem is it's trying to load the whole training dataset to memory and it doesn't fit. I am using 4x3090 trying to train on resnet50 with Webface260M. When I split the data and only trained on the first 500k classes it worked without problems. There has to be a way to configure the dataloader to lazyload the data from disk but I don't have the skills to implement it.
Have you found a way to eliminate this? I think this is not about dataloader but about the classification head (2M identity) -> ~1.1B params (according to your log)
Eventually I managed to prevent oom errors by making a 60GB swap file and training on 8xA100 with batch size 128
Have you found a way to eliminate this? I think this is not about dataloader but about the classification head (2M identity) -> ~1.1B params (according to your log)
correct, the large number of identities creates a massive last fully connected layer, I have managed to train AdaFace on WebFace42M using this implementation of PartialFC by insightface.
Have you found a way to eliminate this? I think this is not about dataloader but about the classification head (2M identity) -> ~1.1B params (according to your log)
correct, the large number of identities creates a massive last fully connected layer, I have managed to train AdaFace on WebFace42M using this implementation of PartialFC by insightface.
Can you open source the method of using partialfc to train adaface? Thank you so much!
Have you found a way to eliminate this? I think this is not about dataloader but about the classification head (2M identity) -> ~1.1B params (according to your log)
correct, the large number of identities creates a massive last fully connected layer, I have managed to train AdaFace on WebFace42M using this implementation of PartialFC by insightface.
can you send data webface42M or pretrained model for me. my email is kakashijin15@gmail.com Thank you so much!
Dear @mk-minchul , thank you for this amazing work,
I have been trying to produce results from training adaface on WebFace42m with
ResNet100
, I'm using 8 * A100 (40 GB) Gpus, but I keep getting OOM (Out of memory) error even though i'm usingddp
as the training strategy.Training parameters:
Batch size = 32
num workers = 8
strategy = ddp
use_mxrecord = True
Here it the training log
Than you. Lamia