Open goord opened 3 years ago
Can you also test the rcemip
testcase and increase the work load to by setting n_col
to 64**2
? The allsky
and rfmip
cases are relatively cheap and are mainly used for correctness checking.
Yes for some reason I couldn't get the input file for the rcemip case... will try once more.
Send me a message on Slack if I need to clarify something, maybe the make_links.sh
misses a step.
I managed to run all three tests on the DAS.
Ah I first needed to run test_rcemip_input.py
. Ok, here is the updated table:
case | config | longwave time [ms] | shortwave time [ms] |
---|---|---|---|
allsky | gcc | 251 | 255 |
allsky | gcc+cuda | 105 | 94 |
rfmip | gcc | 316 | 255 |
rfmip | gcc+cuda | 98 | 78 |
rcemip | gcc | 38418 | 37622 |
rcemip | gcc+cuda | 3708 | 3337 |
This is running without the 'cloud optics' flag btw.
Would it make sense if I also provide the same table for DAS-5? We have a node with an A100 (but older and slower Xeon CPUs).
You can tell me if those numbers are in line with mine Alessio.
Fixed my latest table which was transposed
I did a quick benchmark on an AWS P3 V100 instance:
case | config | longwave (ms) | shortwave (ms) |
---|---|---|---|
rcemip | gcc+cuda | 3117 | 2331 |
rcemip | gcc | 14365 | 11362 |
After using fortran compiler flags (!) I get following figures:
case | config | longwave (ms) | shortwave (ms) |
---|---|---|---|
rcemip | gcc+cuda | 3688 | 3319 |
rcemip | gcc | 8084 | 6982 |
So A100's (at least the ones in juwels-booster) appear somewhat slower than the V100 card. This may indicate the code needs some re-tuning...
We have not done much tuning yet. We have rushed a little in getting a reference implementation ready that gives identical results with the CPU version, but there is probably still a lot to gain in kernel tuning.
Some ad-hoc tuning of the kernel block sizes reduces the timings to 3051 ms (lw) and 2669 ms (sw) so yes, there is definitely headroom for speedup by tuning on A100s.
Further manual tuning established
case | config | longwave (ms) | shortwave (ms) |
---|---|---|---|
rcemip | gcc+cuda | 1439 | 1174 |
For the CUDA runs I used the cuda branch.