Closed OscarAntepara closed 5 months ago
Yes, this is on pmgpu. I haven't tried frontier.
Thanks @OscarAntepara !
So the issue was that writing to MDFields (e.g., Residual(cell,node,1)
) is expensive on GPUs?
Is it possible to refactor the code to avoid the duplication of this code?
https://github.com/sandialabs/Albany/blob/master/src/landIce/evaluators/LandIce_StokesFOResid_Def.hpp#L193-L209
Thanks @OscarAntepara ! So the issue was that writing to MDFields (e.g.,
Residual(cell,node,1)
) is expensive on GPUs?
Right, it's better to write locally in the internal loop than write to device memory outside the loop. That might not be the case on CPU... Oscar could you check cpu performance?
Is it possible to refactor the code to avoid the duplication of this code? https://github.com/sandialabs/Albany/blob/master/src/landIce/evaluators/LandIce_StokesFOResid_Def.hpp#L193-L209
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient).
Yeah, doing the accumulation locally improves data locality in the GPU which gives a better performance. I will check the cpu performance.
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient)
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places.
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient)
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places.
Right, probably okay to turn into an inline function then. I don't think it will be too complicated in terms of readability.
Thanks @OscarAntepara ! So the issue was that writing to MDFields (e.g.,
Residual(cell,node,1)
) is expensive on GPUs?Right, it's better to write locally in the internal loop than write to device memory outside the loop. That might not be the case on CPU... Oscar could you check cpu performance?
Is it possible to refactor the code to avoid the duplication of this code? https://github.com/sandialabs/Albany/blob/master/src/landIce/evaluators/LandIce_StokesFOResid_Def.hpp#L193-L209
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient).
For the 16km test with CPUs there is not much difference between the original and the new code.
Original:
Phalanx: Evaluator 15: [Residual] StokesFOResid
New:
Phalanx: Evaluator 15: [Residual] StokesFOResid
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient)
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places.
Right, probably okay to turn into an inline function then. I don't think it will be too complicated in terms of readability.
I have modified the code to avoid the code duplication mentioned before.
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient)
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places.
Right, probably okay to turn into an inline function then. I don't think it will be too complicated in terms of readability.
I have modified the code to avoid the code duplication mentioned before.
Thanks! I think it's a reasonable approach. Have you tested it already? I'm not sure if we still have some tests using Tets, in which case you would have to add a specialization for numNodes==4
.
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient)
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places.
Right, probably okay to turn into an inline function then. I don't think it will be too complicated in terms of readability.
I have modified the code to avoid the code duplication mentioned before.
Thanks! I think it's a reasonable approach. Have you tested it already? I'm not sure if we still have some tests using Tets, in which case you would have to add a specialization for
numNodes==4
.
I just have tested it for the ant-16km that is just with numNodes=8. Is there a test for LandIce3D with numNodes=4? is that using just tetrahedrons?
We can probably do that with an inline device function but it might get ugly. We thought by keeping the old implementation it might improve readability (similar to what we do for the optimized gradient)
OK, it's just that if we need to modify that computation kernel we have to do it in multiple places.
Right, probably okay to turn into an inline function then. I don't think it will be too complicated in terms of readability.
I have modified the code to avoid the code duplication mentioned before.
Thanks! I think it's a reasonable approach. Have you tested it already? I'm not sure if we still have some tests using Tets, in which case you would have to add a specialization for
numNodes==4
.I just have tested it for the ant-16km that is just with numNodes=8. Is there a test for LandIce3D with numNodes=4? is that using just tetrahedrons?
We used to have a lot, but e converted most of them to Wedges. I couldn't find one with a quick search. Can you simply run Albany landice ctests?
Albany landice ctests
Yeah, I got this:
Test project /pscratch/sd/o/oantepar/fanssie/builds/albany_pm_gpu_nouvm_gnu_sfad16 Start 1: unit_NullSpaceUtils_UnitTest_Serial 1/20 Test #1: unit_NullSpaceUtils_UnitTest_Serial ............................ Passed 9.43 sec Start 2: unit_NullSpaceUtils_UnitTest_Parallel 2/20 Test #2: unit_NullSpaceUtils_UnitTest_Parallel .......................... Passed 14.38 sec Start 3: unit_StringUtils_UnitTest 3/20 Test #3: unit_StringUtils_UnitTest ...................................... Passed 88.81 sec Start 4: unit_HessianVecFad_UnitTest 4/20 Test #4: unit_HessianVecFad_UnitTest .................................... Passed 24.46 sec Start 5: disc_stk_STKDisc_UnitTest_Serial 5/20 Test #5: disc_stk_STKDisc_UnitTest_Serial ............................... Passed 60.67 sec Start 6: disc_stk_STKDisc_UnitTest_Parallel 6/20 Test #6: disc_stk_STKDisc_UnitTest_Parallel ............................. Passed 80.63 sec Start 7: unit_evaluators_DOFInterpolation_UnitTest_Serial 7/20 Test #7: unit_evaluators_DOFInterpolation_UnitTest_Serial ............... Passed 35.56 sec Start 8: unit_evaluators_DOFInterpolation_UnitTest_Parallel 8/20 Test #8: unit_evaluators_DOFInterpolation_UnitTest_Parallel ............. Passed 29.89 sec Start 9: unit_evaluators_GatherSolution_UnitTest_Serial 9/20 Test #9: unit_evaluators_GatherSolution_UnitTest_Serial ................. Passed 29.02 sec Start 10: unit_evaluators_GatherSolution_UnitTest_Parallel 10/20 Test #10: unit_evaluators_GatherSolution_UnitTest_Parallel ............... Passed 31.00 sec Start 11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial 11/20 Test #11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial ..... Passed 31.92 sec Start 12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel 12/20 Test #12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel ... Passed 10.94 sec Start 13: unit_evaluators_ScatterResidual_UnitTest_Serial 13/20 Test #13: unit_evaluators_ScatterResidual_UnitTest_Serial ................Failed 82.82 sec Start 14: unit_evaluators_ScatterResidual_UnitTest_Parallel 14/20 Test #14: unit_evaluators_ScatterResidual_UnitTest_Parallel ..............Failed 47.41 sec Start 15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial 15/20 Test #15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial ..........Failed 13.53 sec Start 16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel 16/20 Test #16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel ........Failed 152.74 sec Start 17: LandIce_FO_Dome_Ascii 17/20 Test #17: LandIce_FO_Dome_Ascii .......................................... Passed 45.23 sec Start 18: LandIce_FO_Dome_Restart 18/20 Test #18: LandIce_FO_Dome_Restart ........................................ Passed 58.91 sec Start 19: landIce_FO_AIS_16km_decompMesh 19/20 Test #19: landIce_FO_AIS_16km_decompMesh .................................Failed 43.10 sec Start 20: landIce_FO_AIS_16km_MueLuKokkos Failed test dependencies: landIce_FO_AIS_16km_decompMesh 20/20 Test #20: landIce_FO_AIS_16km_MueLuKokkos ................................Not Run 0.00 sec
70% tests passed, 6 tests failed out of 20
Label Time Summary: Forward = 104.14 secproc (3 tests) LandIce = 104.14 secproc (3 tests) Serial = 104.14 secproc (2 tests) unit = 743.20 secproc (16 tests)
Total Test time (real) = 890.49 sec
The following tests FAILED: 13 - unit_evaluators_ScatterResidual_UnitTest_Serial (Failed) 14 - unit_evaluators_ScatterResidual_UnitTest_Parallel (Failed) 15 - unit_evaluators_ScatterScalarResponse_UnitTest_Serial (Failed) 16 - unit_evaluators_ScatterScalarResponse_UnitTest_Parallel (Failed) 19 - landIce_FO_AIS_16km_decompMesh (Failed) 20 - landIce_FO_AIS_16km_MueLuKokkos (Not Run) Errors while running CTest
Albany landice ctests
Yeah, I got this:
Test project /pscratch/sd/o/oantepar/fanssie/builds/albany_pm_gpu_nouvm_gnu_sfad16 Start 1: unit_NullSpaceUtils_UnitTest_Serial 1/20 Test #1: unit_NullSpaceUtils_UnitTest_Serial ............................ Passed 9.43 sec Start 2: unit_NullSpaceUtils_UnitTest_Parallel 2/20 Test #2: unit_NullSpaceUtils_UnitTest_Parallel .......................... Passed 14.38 sec Start 3: unit_StringUtils_UnitTest 3/20 Test #3: unit_StringUtils_UnitTest ...................................... Passed 88.81 sec Start 4: unit_HessianVecFad_UnitTest 4/20 Test #4: unit_HessianVecFad_UnitTest .................................... Passed 24.46 sec Start 5: disc_stk_STKDisc_UnitTest_Serial 5/20 Test #5: disc_stk_STKDisc_UnitTest_Serial ............................... Passed 60.67 sec Start 6: disc_stk_STKDisc_UnitTest_Parallel 6/20 Test #6: disc_stk_STKDisc_UnitTest_Parallel ............................. Passed 80.63 sec Start 7: unit_evaluators_DOFInterpolation_UnitTest_Serial 7/20 Test #7: unit_evaluators_DOFInterpolation_UnitTest_Serial ............... Passed 35.56 sec Start 8: unit_evaluators_DOFInterpolation_UnitTest_Parallel 8/20 Test #8: unit_evaluators_DOFInterpolation_UnitTest_Parallel ............. Passed 29.89 sec Start 9: unit_evaluators_GatherSolution_UnitTest_Serial 9/20 Test #9: unit_evaluators_GatherSolution_UnitTest_Serial ................. Passed 29.02 sec Start 10: unit_evaluators_GatherSolution_UnitTest_Parallel 10/20 Test #10: unit_evaluators_GatherSolution_UnitTest_Parallel ............... Passed 31.00 sec Start 11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial 11/20 Test #11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial ..... Passed 31.92 sec Start 12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel 12/20 Test #12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel ... Passed 10.94 sec Start 13: unit_evaluators_ScatterResidual_UnitTest_Serial 13/20 Test #13: unit_evaluators_ScatterResidual_UnitTest_Serial ................Failed 82.82 sec Start 14: unit_evaluators_ScatterResidual_UnitTest_Parallel 14/20 Test #14: unit_evaluators_ScatterResidual_UnitTest_Parallel ..............Failed 47.41 sec Start 15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial 15/20 Test #15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial ..........Failed 13.53 sec Start 16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel 16/20 Test #16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel ........Failed 152.74 sec Start 17: LandIce_FO_Dome_Ascii 17/20 Test #17: LandIce_FO_Dome_Ascii .......................................... Passed 45.23 sec Start 18: LandIce_FO_Dome_Restart 18/20 Test #18: LandIce_FO_Dome_Restart ........................................ Passed 58.91 sec Start 19: landIce_FO_AIS_16km_decompMesh 19/20 Test #19: landIce_FO_AIS_16km_decompMesh .................................Failed 43.10 sec Start 20: landIce_FO_AIS_16km_MueLuKokkos Failed test dependencies: landIce_FO_AIS_16km_decompMesh 20/20 Test #20: landIce_FO_AIS_16km_MueLuKokkos ................................Not Run 0.00 sec
70% tests passed, 6 tests failed out of 20
Label Time Summary: Forward = 104.14 sec_proc (3 tests) LandIce = 104.14 sec_proc (3 tests) Serial = 104.14 sec_proc (2 tests) unit = 743.20 sec_proc (16 tests)
Total Test time (real) = 890.49 sec
The following tests FAILED: 13 - unit_evaluators_ScatterResidual_UnitTest_Serial (Failed) 14 - unit_evaluators_ScatterResidual_UnitTest_Parallel (Failed) 15 - unit_evaluators_ScatterScalarResponse_UnitTest_Serial (Failed) 16 - unit_evaluators_ScatterScalarResponse_UnitTest_Parallel (Failed) 19 - landIce_FO_AIS_16km_decompMesh (Failed) 20 - landIce_FO_AIS_16km_MueLuKokkos (Not Run) Errors while running CTest
for 19 and 20, you probably need to put the trilinos libs in your LD_LIBRARY_PATH
.
Albany landice ctests
Yeah, I got this:
Test project /pscratch/sd/o/oantepar/fanssie/builds/albany_pm_gpu_nouvm_gnu_sfad16 Start 1: unit_NullSpaceUtils_UnitTest_Serial 1/20 Test #1: unit_NullSpaceUtils_UnitTest_Serial ............................ Passed 9.43 sec Start 2: unit_NullSpaceUtils_UnitTest_Parallel 2/20 Test #2: unit_NullSpaceUtils_UnitTest_Parallel .......................... Passed 14.38 sec Start 3: unit_StringUtils_UnitTest 3/20 Test #3: unit_StringUtils_UnitTest ...................................... Passed 88.81 sec Start 4: unit_HessianVecFad_UnitTest 4/20 Test #4: unit_HessianVecFad_UnitTest .................................... Passed 24.46 sec Start 5: disc_stk_STKDisc_UnitTest_Serial 5/20 Test #5: disc_stk_STKDisc_UnitTest_Serial ............................... Passed 60.67 sec Start 6: disc_stk_STKDisc_UnitTest_Parallel 6/20 Test #6: disc_stk_STKDisc_UnitTest_Parallel ............................. Passed 80.63 sec Start 7: unit_evaluators_DOFInterpolation_UnitTest_Serial 7/20 Test #7: unit_evaluators_DOFInterpolation_UnitTest_Serial ............... Passed 35.56 sec Start 8: unit_evaluators_DOFInterpolation_UnitTest_Parallel 8/20 Test #8: unit_evaluators_DOFInterpolation_UnitTest_Parallel ............. Passed 29.89 sec Start 9: unit_evaluators_GatherSolution_UnitTest_Serial 9/20 Test #9: unit_evaluators_GatherSolution_UnitTest_Serial ................. Passed 29.02 sec Start 10: unit_evaluators_GatherSolution_UnitTest_Parallel 10/20 Test #10: unit_evaluators_GatherSolution_UnitTest_Parallel ............... Passed 31.00 sec Start 11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial 11/20 Test #11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial ..... Passed 31.92 sec Start 12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel 12/20 Test #12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel ... Passed 10.94 sec Start 13: unit_evaluators_ScatterResidual_UnitTest_Serial 13/20 Test #13: unit_evaluators_ScatterResidual_UnitTest_Serial ................Failed 82.82 sec Start 14: unit_evaluators_ScatterResidual_UnitTest_Parallel 14/20 Test #14: unit_evaluators_ScatterResidual_UnitTest_Parallel ..............Failed 47.41 sec Start 15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial 15/20 Test #15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial ..........Failed 13.53 sec Start 16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel 16/20 Test #16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel ........Failed 152.74 sec Start 17: LandIce_FO_Dome_Ascii 17/20 Test #17: LandIce_FO_Dome_Ascii .......................................... Passed 45.23 sec Start 18: LandIce_FO_Dome_Restart 18/20 Test #18: LandIce_FO_Dome_Restart ........................................ Passed 58.91 sec Start 19: landIce_FO_AIS_16km_decompMesh 19/20 Test #19: landIce_FO_AIS_16km_decompMesh .................................Failed 43.10 sec Start 20: landIce_FO_AIS_16km_MueLuKokkos Failed test dependencies: landIce_FO_AIS_16km_decompMesh 20/20 Test #20: landIce_FO_AIS_16km_MueLuKokkos ................................Not Run 0.00 sec
70% tests passed, 6 tests failed out of 20 Label Time Summary: Forward = 104.14 sec_proc (3 tests) LandIce = 104.14 sec_proc (3 tests) Serial = 104.14 sec_proc (2 tests) unit = 743.20 sec_proc (16 tests) Total Test time (real) = 890.49 sec The following tests FAILED: 13 - unit_evaluators_ScatterResidual_UnitTest_Serial (Failed) 14 - unit_evaluators_ScatterResidual_UnitTest_Parallel (Failed) 15 - unit_evaluators_ScatterScalarResponse_UnitTest_Serial (Failed) 16 - unit_evaluators_ScatterScalarResponse_UnitTest_Parallel (Failed) 19 - landIce_FO_AIS_16km_decompMesh (Failed) 20 - landIce_FO_AIS_16km_MueLuKokkos (Not Run) Errors while running CTest
for 19 and 20, you probably need to put the trilinos libs in your
LD_LIBRARY_PATH
.
Trueee, now I got this: Test project /pscratch/sd/o/oantepar/fanssie/builds/albany_pm_gpu_nouvm_gnu_sfad16 Start 1: unit_NullSpaceUtils_UnitTest_Serial 1/20 Test #1: unit_NullSpaceUtils_UnitTest_Serial ............................ Passed 13.21 sec Start 2: unit_NullSpaceUtils_UnitTest_Parallel 2/20 Test #2: unit_NullSpaceUtils_UnitTest_Parallel .......................... Passed 32.46 sec Start 3: unit_StringUtils_UnitTest 3/20 Test #3: unit_StringUtils_UnitTest ...................................... Passed 55.65 sec Start 4: unit_HessianVecFad_UnitTest 4/20 Test #4: unit_HessianVecFad_UnitTest .................................... Passed 39.80 sec Start 5: disc_stk_STKDisc_UnitTest_Serial 5/20 Test #5: disc_stk_STKDisc_UnitTest_Serial ............................... Passed 11.90 sec Start 6: disc_stk_STKDisc_UnitTest_Parallel 6/20 Test #6: disc_stk_STKDisc_UnitTest_Parallel ............................. Passed 5.44 sec Start 7: unit_evaluators_DOFInterpolation_UnitTest_Serial 7/20 Test #7: unit_evaluators_DOFInterpolation_UnitTest_Serial ............... Passed 15.41 sec Start 8: unit_evaluators_DOFInterpolation_UnitTest_Parallel 8/20 Test #8: unit_evaluators_DOFInterpolation_UnitTest_Parallel ............. Passed 16.10 sec Start 9: unit_evaluators_GatherSolution_UnitTest_Serial 9/20 Test #9: unit_evaluators_GatherSolution_UnitTest_Serial ................. Passed 7.26 sec Start 10: unit_evaluators_GatherSolution_UnitTest_Parallel 10/20 Test #10: unit_evaluators_GatherSolution_UnitTest_Parallel ............... Passed 62.92 sec Start 11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial 11/20 Test #11: unit_evaluators_GatherDistributedParameter_UnitTest_Serial ..... Passed 60.16 sec Start 12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel 12/20 Test #12: unit_evaluators_GatherDistributedParameter_UnitTest_Parallel ... Passed 41.62 sec Start 13: unit_evaluators_ScatterResidual_UnitTest_Serial 13/20 Test #13: unit_evaluators_ScatterResidual_UnitTest_Serial ................Failed 12.05 sec Start 14: unit_evaluators_ScatterResidual_UnitTest_Parallel 14/20 Test #14: unit_evaluators_ScatterResidual_UnitTest_Parallel ..............Failed 29.00 sec Start 15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial 15/20 Test #15: unit_evaluators_ScatterScalarResponse_UnitTest_Serial ..........Failed 51.83 sec Start 16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel 16/20 Test #16: unit_evaluators_ScatterScalarResponse_UnitTest_Parallel ........Failed 8.81 sec Start 17: LandIce_FO_Dome_Ascii 17/20 Test #17: LandIce_FO_Dome_Ascii .......................................... Passed 11.40 sec Start 18: LandIce_FO_Dome_Restart 18/20 Test #18: LandIce_FO_Dome_Restart ........................................ Passed 9.23 sec Start 19: landIce_FO_AIS_16km_decompMesh 19/20 Test #19: landIce_FO_AIS_16km_decompMesh ................................. Passed 22.59 sec Start 20: landIce_FO_AIS_16km_MueLuKokkos 20/20 Test #20: landIce_FO_AIS_16km_MueLuKokkos ................................ Passed 14.41 sec
80% tests passed, 4 tests failed out of 20
Label Time Summary: Forward = 35.04 secproc (3 tests) LandIce = 35.04 secproc (3 tests) Serial = 20.63 secproc (2 tests) unit = 463.64 secproc (16 tests)
Total Test time (real) = 521.31 sec
The following tests FAILED: 13 - unit_evaluators_ScatterResidual_UnitTest_Serial (Failed) 14 - unit_evaluators_ScatterResidual_UnitTest_Parallel (Failed) 15 - unit_evaluators_ScatterScalarResponse_UnitTest_Serial (Failed) 16 - unit_evaluators_ScatterScalarResponse_UnitTest_Parallel (Failed) Errors while running CTest
OK, I don't think that your changes are affecting the unit tests, so I think you are good to go.
In case you want to understand what's going on, you can run single tests with verbose output doing:
ctest -VV -R unit_evaluators_ScatterResidual_UnitTest_Serial
If people are curious, I'm seeing the same errors that you have here: https://my.cdash.org/viewTest.php?buildid=2560428
Thanks, we should open an issue about those tests failing.
Thanks, we should open an issue about those tests failing.
Those tests are currently not expected to pass since these are uvm-free builds. If you want I can start an issue that tracks the status of uvm-free tests and which are currently known to fail.
Thanks, we should open an issue about those tests failing.
Those tests are currently not expected to pass since these are uvm-free builds. If you want I can start an issue that tracks the status of uvm-free tests and which are currently known to fail.
Oh, OK. Should we disable them in UVM-free builds? Anyway, I'm fine either way, and it's OK to do nothing if you plan to make these tests work in UVM-free builds.
Frontier numbers from Oscar: Original:
Residual: 4ms
Jacobian: 71ms
New:
Residual: 1ms
Jacobian: 44ms
Optimizing StokesFOResid for LandIce 3D by cleaning the code, removing if statements inside the kernel, doing local accumulation and using compile time variables for the loops. For the ant-16km test, the original code timings are:
New code timings are: