Longer training time than expected

microsoft / satclip

PyTorch implementation of SatCLIP

MIT License

190 stars 19 forks source link

Hi there,

I'm trying to reproduce the pre-training of the SatClip based on S100 datset. In the default.yaml, I changed the following:

in_channels parameter to 13 and the vision_layer to moco_resnet50 - as @konstantinklemmer recommended
the batch_sizeto 1000. It is 8k in the paper, but when I set 8k, I run into RuntimeError: DataLoader worker (pid 2968098) is killed by signal: Killed.
num_workersto 11 (to mach the number of iterations in one epoch).

I'm also using single A100 GPU, 11 cores and up to 256GB RAM.

The problem I'm facing is that one epoch takes really long time (probably for loading all the images). My data is stored on a SSD with a decent connection to the A100 tower. The time is approximately 36min per epoch which is 6 times more than what is indicated in the paper (i.e. 2 days for 500 epochs on a single A100 GPU). Do you know what might be the problem? May I ask which parameters and machines you used for training with moco_resnet50?

Kind regards, Elena

microsoft / satclip

Longer training time than expected #15