I'm trying to reproduce the pre-training of the SatClip based on S100 datset. In the default.yaml, I changed the following:
in_channels parameter to 13 and the vision_layer to moco_resnet50 - as @konstantinklemmer recommended
the batch_sizeto 1000. It is 8k in the paper, but when I set 8k, I run into RuntimeError: DataLoader worker (pid 2968098) is killed by signal: Killed.
num_workersto 11 (to mach the number of iterations in one epoch).
I'm also using single A100 GPU, 11 cores and up to 256GB RAM.
The problem I'm facing is that one epoch takes really long time (probably for loading all the images). My data is stored on a SSD with a decent connection to the A100 tower. The time is approximately 36min per epoch which is 6 times more than what is indicated in the paper (i.e. 2 days for 500 epochs on a single A100 GPU).
Do you know what might be the problem?
May I ask which parameters and machines you used for training with moco_resnet50?
Set your num_workers to as high as possible without running into memory issues (the error you mention is a memory error).
Package versions can affect dataloading and training times so make sure that those are up to date.
For the exact specs (and cores) of the machine used for training, maybe @calebrob6 can help?
Hi there,
I'm trying to reproduce the pre-training of the SatClip based on S100 datset. In the default.yaml, I changed the following:
in_channels
parameter to 13 and thevision_layer
tomoco_resnet50
- as @konstantinklemmer recommendedbatch_size
to 1000. It is 8k in the paper, but when I set 8k, I run intoRuntimeError: DataLoader worker (pid 2968098) is killed by signal: Killed.
num_workers
to 11 (to mach the number of iterations in one epoch).I'm also using single A100 GPU, 11 cores and up to 256GB RAM.
The problem I'm facing is that one epoch takes really long time (probably for loading all the images). My data is stored on a SSD with a decent connection to the A100 tower. The time is approximately 36min per epoch which is 6 times more than what is indicated in the paper (i.e. 2 days for 500 epochs on a single A100 GPU). Do you know what might be the problem? May I ask which parameters and machines you used for training with
moco_resnet50
?Kind regards, Elena