MueLu: ParameterListInterpreter test on GPU

trilinos / Trilinos

Primary repository for the Trilinos Project

https://trilinos.org/

Other

1.22k stars 568 forks source link

MueLu: ParameterListInterpreter test on GPU #6866

Closed lucbv closed 4 years ago

lucbv commented 4 years ago

Question

@trilinos/muelu @jhux2 @csiefer2 @cgcgcg

Christian did some work a month or two ago on the ParameterListInterpreter tests to have output from all ranks and to clean-up a bit some logic. However we kicked the can regarding that test's behavior on GPU. The main issue is that the output generated on GPU is not the same as the output generated on CPU and that is natural since the algorithms are quite different on that hardware.

Here come the question: what do we want to do on GPU

[ ] add a new gold file folder for GPU and hope that it works for all GPUs (K40, P100, V100 and down the road AMD and Intel hardware)
[ ] disable that test when it comes to GPU testing

Anyone would like to offer their wisdom on that issue?

cgcgcg commented 4 years ago

Another option is to only test deterministic algorithms. That's not great, since it's not what we will use in practice. However, it still would do what the test advertises, i.e. test the parameter list interpreter.

jhux2 commented 4 years ago

For a given GPU architecture, should we expect deterministic behavior, e.g., from aggregation?

lucbv commented 4 years ago

@jhux2 to be honest I'm not sure what the exact state of affair is right now since @brian-kelley is working on coloring algo in KK. If we use a parallel coloring algorithm, it will be unlikely that we can get twice the same coloring and hence the same aggregates... Also we do not have a fully deterministic aggregation stack, some phases are not yet implemented deterministically, finally I do not think that we actually tested the behavior of the deterministic algorithms all that much.

brian-kelley commented 4 years ago

@lucbv I was only improving the non-deterministic parallel, and sequential (host code) dist-2 colorings. The sequential version (including the device -> host -> device deep copies) is now faster than any of the old versions by far, so maybe we could just use that for these tests. Still have to get that checked in though. I was waiting to solve a bug in one of the phases because large-ish distributed problems (8 ranks, 250^3 brick3D) crash randomly in aggregation. Until that, I can't prove it's not a bug in my new coloring.

Since there seems to be demand, I think it would be doable to implement reasonably fast parallel deterministic dist-2 in terms of a triangular structure-only SPGEMM (the one for triangle counting) and then a deterministic dist-1. I also still have your PDF from 2018 with the dependency list algorithm, and tiebreaks using degree and LID.

brian-kelley commented 4 years ago

Btw, my 2c is to not disable this test on GPU.

cgcgcg commented 4 years ago

Can this be closed?