trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.17k stars 559 forks source link

Tpetra: RCP Thread-safety issues with parallel host backends without setting Teuchos_ENABLE_THREAD_SAFE=ON #11921

Open ndellingwood opened 1 year ago

ndellingwood commented 1 year ago

Bug Report

@trilinos/tpetra

This issue is to relay info from https://github.com/kokkos/kokkos/issues/6082 regarding some RCP thread-safety issues encountered testing Tpetra with Kokkos@develop branch with the OpenMP backend (a couple different testing configurations are posted in the issue)

The problem seems to occur with kernels that capture RCPs, creating multiple copies within OpenMP parallel regions which is not thread-safe without setting Teuchos_ENABLE_THREAD_SAFE=ON , e.g. https://github.com/trilinos/Trilinos/blob/cd2c0a141460b8c8ad028a6034e1333d137b3849/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp#L5893-L5969

A couple potential options to resolve this could be to set Teuchos_ENABLE_THREAD_SAFE=ON as default, or make sure no RCPs are passed to Kokkos' parallel_* kernels, or ?

Adding @masterleinad who helped triage this

jhux2 commented 1 year ago

@trilinos/teuchos

bartlettroscoe commented 1 year ago

One would have to do some profiling to see the impact of turning on thread safety for RCP, especially in these low-level routines. Otherwise, I don't have a strong opinion one way or the other.

masterleinad commented 1 year ago

Related to https://github.com/trilinos/Trilinos/issues/11931.

nmm0 commented 1 year ago

Even if Teuchos_ENABLE_THREAD_SAFE=ON has too much of a performance implication if enabled globally, it's still necessary to make sure that anything accessed in a parallel kernel is thread safe. So either in this case we need to make sure the RCP is not passed in to the kernel (so the refcount won't race) or to correctly guard RCP in this instance with some sort of lock.

On the Kokkos side, we can't guarantee that thread-unsafe code works.

csiefer2 commented 1 year ago

Would a configure-time check in Tpetra requiring Teuchos_ENABLE_THREAD_SAFE=ON if you're using a host-threaded backend work for people? @masterleinad @ndellingwood @bartlettroscoe

If someone can think of a code-based way to ensure no RCP capture by kernels, that could work too.

masterleinad commented 1 year ago

Would a configure-time check in Tpetra requiring Teuchos_ENABLE_THREAD_SAFE=ON if you're using a host-threaded backend work for people?

Sounds reasonable if it doesn't break users' configurations.

crtrott commented 1 year ago

We do need a decision on this: we do not promise not to make a copy of the functor per thread inside backends, and I think it would be a serious implementation constraint issue if we did.

csiefer2 commented 1 year ago

As per discussions with @crtrott the most reasonable path forward in the short term is:

Have Teuchos check to see if a threaded Kokkos backend is enabled and require Teuchos_ENABLE_THREAD_SAFE=ON in that case.

Longer term we should consider making Teuchos_ENABLE_THREAD_SAFE=ON the default and make the user turn if off manually if they're sure they don't need it.

@bartlettroscoe is the short term plan OK?

bartlettroscoe commented 1 year ago

@bartlettroscoe is the short term plan OK?

@csiefer2, that sounds reasonable to me.

I just wish Trilinos had a comprehensive and robust performance test suite so we could see the performance impact this will have (before we hear it from users if there is an issue with some algorithms). But note that I suspect that a good bit of this performance critical software could be refactored to use semi-persisting associations (using Teuchos::Ptr<T> instead of Teuchos::RCP<T>) and avoid passing RCPs into such code (see section "5.12.3 Performance tuning strategies, semi-persisting associations" in here).

csiefer2 commented 1 year ago

@bartlettroscoe I can do a PR and mark you as a reviewer.

Our current performance suite is really just designed to spot app-specific regressions, rather than absolutely everything. So, in theory, if this effects apps which are tested we should see performance differences there. You know, in theory ;)

bartlettroscoe commented 1 year ago

@bartlettroscoe I can do a PR and mark you as a reviewer.

👍

csiefer2 commented 1 year ago

@ndellingwood My code change to Teuchos merged. Does this meet your needs?

ndellingwood commented 1 year ago

@csiefer2 mostly, thanks for working on this! There were some changes to kokkos-kernels that need a matching PR to the kokkos-kernels repo (to avoid clobber with the next release), once that goes in we should be in good shape

I enabled Teuchos_ENABLE_THREAD_SAFE=ON in our nightly OpenMP builds to verify the setting would resolve the failures in https://github.com/kokkos/kokkos/issues/6082, so these changes should take care of it. Thanks!

ndellingwood commented 1 year ago

Matching patch with the changes from https://github.com/trilinos/Trilinos/commit/21e643d7566c5afb5b882641378e9237a1681b70 to kokkos-kernels is up https://github.com/kokkos/kokkos-kernels/pull/1854

bartlettroscoe commented 1 year ago

@csiefer2 mostly, thanks for working on this! There were some changes to kokkos-kernels that need a matching PR to the kokkos-kernels repo (to avoid clobber with the next release), once that goes in we should be in good shape

@ndellingwood, sorry I did not follow up on that right after I made the change and thanks for catching this. (Not sure why this was not caught in the testing of PR #11863).

ndellingwood commented 1 year ago

@bartlettroscoe no worries, thanks for the quick fix with https://github.com/trilinos/Trilinos/commit/21e643d7566c5afb5b882641378e9237a1681b70 👍

ndellingwood commented 1 year ago

https://github.com/kokkos/kokkos-kernels/pull/1854 is merged In an OpenMP nightly build (gcc/8.3 on Power9) with Kokkos Core+Kernels develop branches there was a performance test failure with TeuchosCore_RCP_PerformanceTests_basic_MPI_1 following merge of this PR. The build does not run unit tests with parallelism (i.e. ctest -j1). If the test is enabled by PR testing should it be disabled to avoid spurious failures? @bartlettroscoe

bartlettroscoe commented 1 year ago

In an OpenMP nightly build (gcc/8.3 on Power9) with Kokkos Core+Kernels develop branches there was a performance test failure with TeuchosCore_RCP_PerformanceTests_basic_MPI_1 following merge of this PR.

@ndellingwood, can you run make dashboard and post this to CDash?

ndellingwood commented 1 year ago

@ndellingwood, can you run make dashboard and post this to CDash?

@bartlettroscoe I don't have the nightly build configured to post to CDash, it's on my list of TODOs. Is it helpful if I point you to the Jenkins build for now? https://jenkins-son.sandia.gov/job/KokkosEco_Trilinos_Weaver_Gcc830_OpenMP_opt/251/consoleFull

github-actions[bot] commented 2 weeks ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.