trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 566 forks source link

Zoltan does not respect the cmake MPI_EXEC flags #1217

Closed bathmatt closed 6 years ago

bathmatt commented 7 years ago

I set MPI_EXEC_POST_NUMPROCS_FLAGS to "-map-by;socket:PE=8;--oversubscribe" and when I run all the tests I get zoltan failures because it doens't respect these flags

Running: "/home/projects/pwr8-rhel73-lsf/perl/5.22.1/bin/perl" "../ctest_zoltan.pl" "--np" "4" "--debug" "--mpiexec" "/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpiexec" "--mpiexecarg" "-np" "--pkg" "Zoltan"

--------------------------------------------------------------------------------

CTEST_FULL_OUTPUT
--np4--debug--mpiexec/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpiexec--mpiexecarg-np--pkgZoltan
DEBUG HOSTNAME ride16 ride1
DEBUG:  package Zoltan
 05:31:24 up 21 days, 18:57,  0 users,  load average: 12.14, 17.14, 10.52
DEBUG:  mpiexec /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpiexec --mca mpi_yield_when_idle 1 -np
DEBUG  Dir /home/jenkins/workspace/Drekar-ride-cuda-opt-all-test/build_drekar_CUDA_RELEASE_CUDA/packages/zoltan/test/hg_diag500_4 dirname diag500_4
DEBUG  Outfilebase: ;  Dropbase: 
DEBUG  Running test 0 on zdrive.inp.phg
DEBUG  Test name:  phg
DEBUG  Archfilebase: diag500_4.phg.4.; Dropbase: diag500_4.phg.drops.4.
DEBUG Executing now:  /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpiexec --mca mpi_yield_when_idle 1 -np 4 ../zdrive.exe zdrive.inp.phg 2>&1 | tee diag500_4.phg.4.outerr
kddevin commented 7 years ago

Hi, @bathmatt . You are correct, as the perl script used for testing in Zoltan was written largely outside the Trilinos build system; indeed, since Zoltan is downloadable separately from Trilinos, the script does not use any of the cmake variables.

How important is modifying this script to run in this environment?
If it is important, by when do you hope that it is changed?
Do you need full testing, or would some basic sanity testing be sufficient in this environment?

Thanks.

bathmatt commented 7 years ago

I'm trying to get to a clean dashboard on the testbed platforms, Lots of zoltan issues pop up. I can skip the zoltan tests but when something comes up with muelu and zoltan not sure what to do about it. It does set a bad precedent with "Trilinos should pass its tests on platforms we care about"

It isn't super important for Z1 but is for Z2 which looks like it doesn't have this issue.

On Wed, Apr 5, 2017 at 12:58 PM, K Devine notifications@github.com wrote:

Hi, @bathmatt https://github.com/bathmatt . You are correct, as the perl script used for testing in Zoltan was written largely outside the Trilinos build system; indeed, since Zoltan is downloadable separately from Trilinos, the script does not use any of the cmake variables.

How important is modifying this script to run in this environment? If it is important, by when do you hope that it is changed? Do you need full testing, or would some basic sanity testing be sufficient in this environment?

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1217#issuecomment-291961793, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDIDnPZHU2y9piyGf-82NIj44WDZ-bks5rs-RrgaJpZM4M0jMF .

kddevin commented 7 years ago

You are correct that Zoltan2 does not have this issue, since it was born and grew up in the Trilinos environment. Zoltan is different because it was imported into Trilinos long after its birth and it leads a dual life.

Do any Zoltan tests pass on this platform? I am guessing only the perl-script based tests (like the one you show above) fail; if I am wrong, we have a bigger issue with which to deal.

bathmatt commented 7 years ago

these pass Zoltan_ch_serial_zoltan_parallel 1.50408 passed Zoltan_ch_simple3d_zoltan_parallel Zoltan_test_get_callbacks_MPI_4

On Wed, Apr 5, 2017 at 1:13 PM, K Devine notifications@github.com wrote:

You are correct that Zoltan2 does not have this issue, since it was born and grew up in the Trilinos environment. Zoltan is different because it was imported into Trilinos long after its birth and it leads a dual life.

Do any Zoltan tests pass on this platform? I am guessing only the perl-script based tests (like the one you show above) fail; if I am wrong, we have a bigger issue with which to deal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1217#issuecomment-291965984, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDIK-euVVipDNfIDu3CdYrB3q8namGks5rs-fUgaJpZM4M0jMF .

kddevin commented 7 years ago

Hmm, I was hoping to see the following:

1/57 Test #1: Zoltan_stressTestGRAPH_P_0_MPI_8 ........... Passed 0.70 sec 2/57 Test #2: Zoltan_stressTestGRAPH_P_1_MPI_8 ........... Passed 0.69 sec 3/57 Test #3: Zoltan_stressTestGRAPH_0_MPI_8 ............. Passed 0.99 sec 4/57 Test #4: Zoltan_stressTestGRAPH_1_MPI_8 ............. Passed 0.99 sec 5/57 Test #5: Zoltan_stressTestPHG_0_MPI_8 ............... Passed 0.69 sec 6/57 Test #6: Zoltan_stressTestPHG_1_MPI_8 ............... Passed 0.69 sec 7/57 Test #7: Zoltan_test_get_callbacks_MPI_4 ............ Passed 0.68 sec

If the "stressTest" tests did not pass, would you please send me the output from them?

kddevin commented 7 years ago

Or maybe you are configuring with -D MPI_EXEC_MAX_NUMPROCS:STRING=4 so that they aren't run?

bathmatt commented 7 years ago

didn't run... can you get to this? https://jenkins-srn.sandia.gov:8443/view/Drekar/job/Drekar-ride-cuda-opt-all-test/2/ctestResult/

On Wed, Apr 5, 2017 at 1:20 PM, K Devine notifications@github.com wrote:

Hmm, I was hoping to see the following:

1/57 Test #1 https://github.com/trilinos/Trilinos/pull/1: Zoltan_stressTestGRAPH_P_0_MPI_8 ........... Passed 0.70 sec 2/57 Test #2 https://github.com/trilinos/Trilinos/pull/2: Zoltan_stressTestGRAPH_P_1_MPI_8 ........... Passed 0.69 sec 3/57 Test #3 https://github.com/trilinos/Trilinos/issues/3: Zoltan_stressTestGRAPH_0_MPI_8 ............. Passed 0.99 sec 4/57 Test #4 https://github.com/trilinos/Trilinos/issues/4: Zoltan_stressTestGRAPH_1_MPI_8 ............. Passed 0.99 sec 5/57 Test #5 https://github.com/trilinos/Trilinos/pull/5: Zoltan_stressTestPHG_0_MPI_8 ............... Passed 0.69 sec 6/57 Test #6 https://github.com/trilinos/Trilinos/issues/6: Zoltan_stressTestPHG_1_MPI_8 ............... Passed 0.69 sec 7/57 Test #7 https://github.com/trilinos/Trilinos/issues/7: Zoltan_test_get_callbacks_MPI_4 ............ Passed 0.68 sec

If the "stressTest" tests did not pass, would you please send me the output from them?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1217#issuecomment-291967968, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDINtdn6ql03Xc6akEsmZzN3s4VWf4ks5rs-lugaJpZM4M0jMF .

bathmatt commented 7 years ago

correct, 4 procs max

On Wed, Apr 5, 2017 at 1:21 PM, K Devine notifications@github.com wrote:

Or maybe you are configuring with -D MPI_EXEC_MAX_NUMPROCS:STRING=4 so that they aren't run?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1217#issuecomment-291968403, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDILYdSE1A9YuTmMckIt2-dQhheRzLks5rs-nKgaJpZM4M0jMF .

kddevin commented 7 years ago

OK; let me think about this issue a bit. By when do you hope to have this issue resolved? Thanks. In the meantime, zoltan is exercised a bit through zoltan2 tests -- maybe a little peace of mind.

No, I can't access the jenkins page -- 404 error. No worries, though; your list is sufficient.

bathmatt commented 7 years ago

It's not a huge concern as I don't think these are causing my issue failures in my tests. Other issues are higher. But if we get to nearly a clean dashboard these will move up the list.

On Wed, Apr 5, 2017 at 1:27 PM, K Devine notifications@github.com wrote:

OK; let me think about this issue a bit. By when do you hope to have this issue resolved? Thanks. In the meantime, zoltan is exercised a bit through zoltan2 tests -- maybe a little peace of mind.

No, I can't access the jenkins page -- 404 error. No worries, though; your list is sufficient.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1217#issuecomment-291970261, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDIFMFD3tYebJgf3j1ErU1Ro2GO73Wks5rs-s3gaJpZM4M0jMF .

kddevin commented 7 years ago

@bathmatt Is this issue still important for you? I am looking at old Zoltan issues today.

kddevin commented 6 years ago

This bug will be closed unfixed. If it becomes a blocker, please re-open.