psteinb / sota_on_uncertainties

trying to obtain uncertainties from training accuracies using timm
BSD 3-Clause "New" or "Revised" License
9 stars 0 forks source link

Jobs run into time limit on cluster #8

Open zyzzyxdonta opened 2 years ago

zyzzyxdonta commented 2 years ago

With the hemera config, (some?) jobs for the rules imagenette2_resnet50_default and imagenette2_resnext50_default run into the time limit of 75 minutes. I think this only happens when they are scheduled on the nodes with P100 cards. The jobs on V100 cards seem to run fine with just over 60 minutes runtime.

So either, the config should set a longer runtime for these rules, or the partition should be set to gpu_v100. But since not a lot of V100 cards are freely available on hemera, I think it would be best to go with the first option. I don't know by how much the runtime should be increased, though.

psteinb commented 2 years ago

Interesting. Thank you for reporting this. The let's go for 120min.

psteinb commented 2 years ago

If you have other time limits in mind, please let me know.

zyzzyxdonta commented 2 years ago

I tried 90 and that wasn't enough. I'll try 120 tomorrow.

zyzzyxdonta commented 2 years ago
  1. 120 minutes worked. It was quite close, though. The last job finished after about 1h 55min. So maybe go with 135 to be safe?
  2. I confirmed that all cancelled jobs were on P100 GPUs.
  3. snakemake --report report.html is quite interesting (even if this picture isn't too precise):

visualization

For resnet and resnext, the two GPU types can clearly be distinguished 😄