trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.18k stars 559 forks source link

Random mass test failures showing "Error initializing RM connection. Exiting" on 'vortex' #6861

Open bartlettroscoe opened 4 years ago

bartlettroscoe commented 4 years ago

CC: @jjellio

As shown in this query, we are getting random mass failures on test on 'vortex'. As shown in that query, when it occurs in a given build, it impacts over a thousand tests and it is random about which build and which days it impacts. When it occurs, the failures output like:

Error: Remote JSM server is not responding on host vortex5902-19-2020 03:31:02:827 114114 main: Error initializing RM connection. Exiting.
bartlettroscoe commented 4 years ago

@jjellio, I am not sure how much detail we can put in this issue but at least this gets this on the board.

I am setting up to filter out tests that show this failure from the CDash summary emails in the filters:

That will allow me to start cleaning up the failing tests in these builds but not allow these failures to flood out the the emails and make them worthless.

bartlettroscoe commented 4 years ago

I just pushed the following TrilinosATDMStatus repo commit:

*** Base Git Repo: TrilinosATDMStatus
commit b3c45994e778b3d784498f20c8116c419f5a08ed
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Wed Feb 19 07:43:36 2020 -0700

    Filter out random mass 'JSM server not responding' errors (trilinos/Trilinos#6861)

    It seems when it occurs in a build these test failures showing:

       Error: Remote JSM server is not responding on host vortexXXX

    are massive, taking down hundreds to thousands of tests once they start.

    Adding these filters up front just filters them out.

    Note that filtering these tests out before hand will result in tracked tests
    as being listed as missing (twim) if they have this error.

    But this way, we can start to triage all of the builds on vortex.

    I also updated the builds filter to allow the build:

      Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi

    now that we can filter out mass failures due to this "Remote JSM sever" issue.

M       trilinos_atdm_builds_status.sh
M       trilinos_atdm_specialized_cleanup_builds_status.sh

I also updated the Jenkins job:

so that it will submit to the 'Specialized' CDash group so that I can now clean up those tests.

Hopefully this will allow us to clean up the failing tests even while these random mass test failures showing Remote JSM server is not responding on host are occurring. Even if we only get results ever other day, that should be enough to maintain these builds and get useful test results.

bartlettroscoe commented 4 years ago

@jjellio, don't know if this is related, but as shown in this query, we are also seeing failing tests showing errors like:

[csmapi][error] recvmsg timed out. rc=-1
[csmapi][error] RECEIVE ERROR. rlen=-1
[csmapi][error] /home/ppsbld/workspace/PUBLIC_CAST_V1.6.x_ppc64LE_RH7.5_ProdBuild/csmnet/src/C/csm_network_local.c-673: Client-Daemon connection error. errno=11
csm_net_unix_Connect: Resource temporarily unavailable
[csmapi][error] csm_net_unix_Connect() failed: /run/csmd.sock
Error. Failed to initialize CSM library.
Error: It is only possible to use js commands within a job allocation unless CSM is running
02-19-2020 04:07:02:845 50896 main: Error initializing RM connection. Exiting.

And as shown in this query we are seeing random failures showing:

Warning: PAMI CUDA HOOK disabled

What is that?

I will filter all of these out of the CDash summary email filter.

jjellio commented 4 years ago

Just a warning, because this is a single process, it is disabling the 'cuda hooks' (my 2019 issue).

We can see that from the full output thanks to the patch we pushed through:

AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-M -disable_gpu_hooks'
WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0
BEFORE: jsrun  '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'
out_file=4dd1a321b4bcc5c1c294a7bea8279523.out
Warning: PAMI CUDA HOOK disabled
jjellio commented 4 years ago

If we don't disable the cuda hooks, and the process doesn't call MPI_Init first, then the entire test will fail. It is just a benign warning.

bartlettroscoe commented 4 years ago

@jjellio, as shown here, what does the error:

Error: error in ptssup_mkcltsock_afunix()
02-14-2020 04:59:12:982 24104 main: Error initializing RM connection. Exiting.

mean?

jjellio commented 4 years ago

I'll be optimistic and assume it is an extraterrestrial offering of good will and happiness (I have no idea!)

As users, I don't think we can drill into the JSM/RM connection stuff. The admins are taking a careful look at the software stack to see if perhaps some component is missing. All of this CI / automated stuff is going to tease out all kinds of errors (and that we test the case of NP=1 w/out MPI_Init is going to exercise functionality that I do not think many have used - but we need to do it, Trilinos' integrity is verified by both sequential and parallel unit tests.)

I have a very good reproducer for one the RM connection issues, and I passed on some example scripts that can reproduce and demonstrate proper behavior. SO hopefully we can get this hammered out.... my build tools on Vortex utilize jsrun heavily to do on-node configures and compiles, so this has hindered me as well.

bartlettroscoe commented 4 years ago

NOTE: Issue #6875 is really a duplicate of this issue.

FYI: With the Trilinos PR testing system down with no ETA to fix, I manually merged the branch in PR #6876 to the 'atdm-nightly-manual-updates' branch just now in commit 47b673b so it will be in the 'atdm-nightly' branch tonight and we will see this running tomorrow.

Putting this Issue in review to see if this fixes the problem.

jjellio commented 4 years ago

This is not a duplicate of #6875. There should still be JSM RM connection issues (they are being patched by Spectrum), but the issue I raised should help with stability. It is also possible that our collab with LLNL actually fixed another issue, so I am curious how impactful the results of 6875 are. It would be nice if we made a huge dent in this broader RM connection problem.

bartlettroscoe commented 4 years ago

@jjellio, even after the update from PR #6876, we are still getting some errors shown here like:

WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0
BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe' '--use-tpetra' '--use-twod' '--cell=Quad' '--x-elements=16' '--y-elements=16' '--z-elements=4' '--basis-order=2'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe' '--use-tpetra' '--use-twod' '--cell=Quad' '--x-elements=16' '--y-elements=16' '--z-elements=4' '--basis-order=2'
out_file=f5ab0fc965e1c8b955384b56a322d090.out
[vortex3:45451] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[vortex3:45451] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

What does the error OPAL ERROR: Unreachable in file ext3x_client.c mean? From:

It looks like MPI_Init() is trying to be run twice on the same MPI rank (even if done by two processes). How can anything we are doing cause thise? Is this an LSF bug? Have you seen this before?

bartlettroscoe commented 4 years ago

FYI: I have changed the severity of this from ATDM Sev: Critical to ATDM Sev: Nonblocker because This is because I have updated the driver scripts that monitor the ATDM Trilinos builds on CDash to filter out tests that show these errors as described above. Also, as one can see from looking at the ats2 builds over the last 2 weeks, this only occurs in about 1/3rd of the builds or less so we are still getting most test results in 2/3rds of the builds. That is good enough to work on the cleanup of the rest of the tests due to other issues.

jjellio commented 4 years ago

Ross, this is good! It looks like a different type of error, so maybe we actually made some real progress on that elusive RM connection problem!

I am having a face to face w/LLNL folks, and I will ask them about this.

bartlettroscoe commented 4 years ago

Ross, this is good! It looks like a different type of error, so maybe we actually made some real progress on that elusive RM connection problem!

@jjellio, right. We don't seem to be seeing anymore of the of the failures like:

Error: error in ptssup_mkcltsock_afunix()
02-14-2020 04:59:12:982 24104 main: Error initializing RM connection. Exiting.

However, just to be clear, we are still seeing mass random test failures showing Error: Remote JSM server is not responding on host on 'vortex'. For example, just from today you see:

showing a ton of these failing tests.

We are told that the March upgrade of 'vortex' may fix these.

jjellio commented 4 years ago

This may be related to using LD_PRELOAD in the wrapper, but when someone undoes that it is going to cause the tests that alloc before MPI_init to fail. (so we decided to delay it).

It would be worthwhile to delete the LD_PRELOAD stuff and see if that resolves the MPI_init failures (by using LD_PRELOAD I can promise we are doing things outside the way it was intended)

bartlettroscoe commented 4 years ago

FYI: It was suggested in the email chain captured in CDOFA-94 that one workaround would be to kill an allocation and rerun failing tests over and over again when we see the first test failure showing Error: Remote JSM server is not responding on host. That would make for a very complex implementation of the ctest -S driver. I am not sure that would even be possible without extending ctest. I don't even want to think about something like that. An alternative approach would be to run ctest on the login node and get a new interactive bsub allocation for each individual test and that would resolve the problem. But boy would that be slow as it can take several seconds (or longer) to get an allocation and some tests in Trilinos finish in a fraction of a second. Each build has around 2200 tests so that would be about 2200 interactive bsub calls per build!

Here are some option for what to do:

jjellio commented 4 years ago

Ross, would it make sense to lower the cadence of testing on vortex. That is, run the test suite weekly and just keep rerunning it till it works? It let the tests run, but it would give a coarser granularity for figuring out the culprit is a test failed. I tend to think running (and rerunning over a weeks time) would atleast get the library tested - which may be a better alternative than waiting for the machine to work.

bartlettroscoe commented 4 years ago

@jjellio, all things considered, given the current state of things, in my opinion,, what we have now is fine and is the best we can do until they came make the system more stable. And for the most part, running the test suite "until it works" would never terminate because there are always at least some failing tests (just look at the rest of the ATDM Trilinos builds).

It is likely better to discuss this offline than to try to do this in detail over github issue comments.

bartlettroscoe commented 4 years ago

From the updated email thread documented in CDOFA-94, it seems that the upgrade of 'vortex' that would fix the problems with jsrun will not occur until April (or later?). The proposed solution is to run less than 800 jsrun jobs in a single bsub allocation. They claim that should be robust. Therefore, I think we should trim down the Trilinos test suite we run on 'vortex' to just be a few of the critical and actively developed packages like Kokkos, KokkosKernels, Tpetra, Zoltan2, Amesos2, SEACAS, MueLu, and Panzer. Adding up the number of tests for these packages for the build Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi on 2020-02-28 shown here gives 727 tests.

So I guess the plan needs to be to trim down the test that we run on 'vortex'. I think we should still build all of the tests in all of the packages, we will just run a subset of them. To do that, I will need to make a small extension to TriBITS. I will do that when I get some time.

bartlettroscoe commented 4 years ago

The update from the admins documented in CDOFA-94 is that there is a 0.5% chance that any jsrun invocation will fail and once one does fail, then all future jsrun invocations will fail in that bsub node allocation. ETA to fix this is not until an upgrade of the system currently scheduled until April (which means May or later).

At this point, I am not sure what to do about this GitHub Issue. We can't close this issue because it is not really resolved. But there is not really anything we can do about it.

I think I should just leave this "In Review" and then put the "Stalled" labor on this. I think the system we have with the cdash_analyze_and_report.py usage by filtering out these failures is okay for now but it might be nice to filter out these failures as well from the test history since it inflates the number of failures shown in the test history. That will complicate that Python code some but it might be worth it.

This is the most extreme case of system issues that we have had to deal with in the last 2 years.

bartlettroscoe commented 4 years ago

CC: @jjellio

FYI: As shown in this query, today in the build Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg we saw 1267 tests that failed with the the error message:

Error: error in ptssup_mkcltsock_afunix()
03-19-2020 04:05:50:783 40194 main: Error initializing RM connection. Exiting.

That is a different error message that we have been seeing before which looked like:

Error: Remote JSM server is not responding on host vortex5902-19-2020 03:31:02:827 114114 main: Error initializing RM connection. Exiting.

What is messing is the string Error: Remote JSM server is not responding on host which I was using to filter out these mass jsrun failures.

I will update the CDash analysis queries to filter out based on Error initializing RM connection. Exiting instead of Error: Remote JSM server is not responding on host.

bartlettroscoe commented 4 years ago

There was an interesting manifestation of the problem. As shown in:

the update, configure, build, and test results were missing for this build due to the {{lrun}} command failing with:

05:09:58 + env CTEST_DO_TEST=FALSE lrun -n 1 /vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg/Trilinos/cmake/ctest/drivers/atdm/ctest-s-driver.sh
05:10:00 Error: Remote JSM server is not responding on host vortex5903-22-2020 03:10:00:000 68525 main: Error initializing RM connection. Exiting.

Now, there were also mass jsrun failures when the tests ran as well as shown here but what is interesting is that 545 of the tests actaully passed! And several of those tests were np=4 MPI tests.

What this suggests is that it is not true that once the first jsrun command that all of the following jsrun commands will fail. If that were the case, after the first jsrun command for the update, configure, build, and test results failed, then all of the following jsrun commands should have failed as well.

That this also shows is that we need to update the cdash_analyze_and_report.py tool to look on CDash for any missing results including update, configure, build, or test results. If any of those are missing for a given build on CDash, then the build should be listed as "missing" and the things that are missing should be listed in the "Missing Status" field. So you would just list "Update", "Configure" and "Build" in the "Missing Status" field for this build.

bartlettroscoe commented 4 years ago

FYI: They closed:

assuming this was fixed because of an upgrade of 'vortex' last month. But this was not resolved and I create the new issue:

I think we are going to be living with this problem for the foreseeable future (so we might as well further refine our processes to deal with this better).

ghost commented 4 years ago

sorry, This is Bing. I see the same error on our testbed with the LSF resource management system.

I came across this discussion and seems the issue is still alive. Any suggestion/conclusion on this?

thanks.

bartlettroscoe commented 4 years ago

I see the same error on our testbed with the LSF resource management system.

@lalalaxla, as far as I know, these errors are unique to the ATS-2 system and the jsrun driver.

Any suggestion/conclusion on this?

No, this is still ongoing as you can see from the mass random test failures shown here. But we just filter these out and the don't really damage too much our ability to do testing on this system and stay on top of them.

There is supposed to be an upgrade system in the near future that is supposed to resolve these issues. (Cross out fingers.)

ghost commented 4 years ago

I see. thanks. Then, my question is how to filter out the error?

I am reporting this error from an Oak Ridge testbed system Tundra. It is a single rack with 18 nodes similar to Summit, but with reduced hardware (POWER8 CPUs, ½ the bandwidth and memory per node but similar NVMe SSDs).

We actually see the same error on Summit (https://docs.olcf.ornl.gov/systems/summit_user_guide.html). The error is reported on this page, you can search "Remote JSM server" to locate it.

bartlettroscoe commented 4 years ago

Then, my question is how to filter out the error?

@lalalaxla, are you reporting results to CDash? If so, and if you have a very recent version of CDash, then you can use the new "Test Output" filter on the cdash/queryTests.php page to filter them out. See above. You can see an example of these filters in action in:

But, again, you will need a very recent version of CDash. (I can provide the info on a safe recent version.)

ghost commented 4 years ago

Roscoe,

thanks! I will check internally to see if we take the same steps and have the most updated CDash. Will get back to you soon.

Bing

bartlettroscoe commented 4 years ago

FYI: The big upgrade of the software env and LSF on 'vortex' that occurred over the last few days did NOT fix these mass random 'jsrun' failures. Today as shown here where you see the build Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg had 1286 mass failures as shown in this query.

So we live on with these mass random failures. But we are filtering them out okay so they are not terribly damaging to the testing process for Trilinos.

bartlettroscoe commented 3 years ago

After the sysadmins changing 'vortex' to use a private launch node by defaults for 'bsub', it seems that all random jsrun failures are gone. See the evidence below. For more context and info, see ATDV-402.


As shown in the CDash queries:

it looks like whatever they did to update 'vortex', it looks like all of the mass random test failures have gone away (or at least we have not seen any mass failures for over 2 weeks with the last mass failure on 2020-10-21). Those queries show that in 2 weeks starting 2020-10-22 going through 2020-11-04, there were 78 + 26 = 104 Trilinos-atdm-ats2 builds that ran tests and in not one of these were there any mass random test failures!

I think this is pretty good evidence that this issue is resolved.

bartlettroscoe commented 3 years ago

Shoot, it looks like we had another case of mass random jsrun failures on 2020-11-06 shown at:

showing:

Site Build Name Update Update Time Conf Err Conf Warn Conf Time Build Err Build Warn Build Time Test Not Run Test Fail Test Pass Test Time Test Proc Time Start Test Time Labels
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi 0 4 4m 9s 0 0 1m 31s 0 2295 27 47m 23s 3h 6m 1s Nov 06, 2020 - 17:31 MST (31 labels)

with 2295 failing tests matching this criteria shown at:

I need to reopen this issue. And I will bring back the filters for this :-(

bartlettroscoe commented 3 years ago

Related to:

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 2 years ago

This issue was closed due to inactivity for 395 days.

bartlettroscoe commented 2 years ago

CC: @trilinos/framework

This is still occurring and it is failing Trilinos PR builds now. See https://github.com/trilinos/Trilinos/pull/10648#issuecomment-1164470776.

bartlettroscoe commented 2 years ago

Repopening

e10harvey commented 2 years ago

Framework is in the process of migration off of ats2.

bartlettroscoe commented 2 years ago

New internal issue was opened for this in 6/6/2022

The word is that this will never be fixed on ATS2 so we just need to live with this.

One could address this by grepping *.xml file produced by ctest for "Error initializing RM connection. Exiting" and if found, then do a new allocation and run the tests over again. (But that adds a lot of complexity to the testing process.) Better to just not run PR builds on this type of machine. (But we knew this was an issue with ATS2 over 2 years ago, as per this GitHub Issue. It is not that big of an issue for nightly testing but it is more serious for PR testing.)

bartlettroscoe commented 1 year ago

Note, this query, showing all PR builds on 'vortex' between 2022-08-01 and 2022-08-16 shows 129 PR builds on 'vortex'. While this query showing builds on 'vortex' with over 50 test failures shows 7 builds with mass random failures do this this defect on the ATS-2 system with the jsrun command showing the "Error initializing RM connection" error. That shows a probability of failure of about 5%. That may not seem like much but when you combine this with all of the other random failures doing on in Trilinos, they add up.

For example, this was the only failure in the last PR iteration https://github.com/trilinos/Trilinos/pull/10827#issuecomment-1216214989 and it would have 100% passed other than for that error.

bartlettroscoe commented 1 year ago

I am going to go ahead and pin this issue because this is still taking out PR builds and people are still asking questions about this.