trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 565 forks source link

ArrayView in DEBUG build gives (probably) erroneous RCP message #184

Closed ambrad closed 8 years ago

ambrad commented 8 years ago

Consider the following short program:

#include "Tpetra_Map.hpp"
#include "Teuchos_ArrayView.hpp"
#include "Tpetra_DefaultPlatform.hpp"
#include "Teuchos_GlobalMPISession.hpp"

typedef double ST;
typedef int GO;
typedef int LO;
typedef Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial> KokkosNode;
typedef Tpetra::Map<LO, GO, KokkosNode> Tpetra_Map;

int main (int argc, char** argv) {
  Teuchos::GlobalMPISession mpiSession(&argc, &argv);
  auto comm = Tpetra::DefaultPlatform::getDefaultPlatform().getComm();
  static const int N = 10;
#if 1
  // In a Trilinos DEBUG build, fails with RCP message.
  GO* ids = new GO[N];
  for (int i = 0; i < N; ++i) ids[i] = i;
  Teuchos::ArrayView<const GO> ids_av = Teuchos::arrayView(ids, N);
  Tpetra_Map m(N, ids_av, 0, comm);
  delete[] ids;
#else
  // OK (and better, anyway).
  Teuchos::Array<GO> ids(N);
  for (int i = 0; i < N; ++i) ids[i] = i;
  Tpetra_Map m(N, ids, 0, comm);
#endif
}

In the #if 1 branch, the program fails with:

terminate called after throwing an instance of 'Teuchos::DanglingReferenceError'
  what():  /home/ambradl/bigcode/tril/Trilinos/install-shared-dbg/include/Teuchos_RCPNode.hpp:605:

Throw number = 1

Throw test that evaluated to true: true

Error, an attempt has been made to dereference the underlying object
from a weak smart pointer object where the underling object has already
been deleted since the strong count has already gone to zero.

The #if 0 branch is fine. This appears to be a bug in ArrayView in a DEBUG build.

mhoemmen commented 8 years ago

As a temporary work-around, consider supplying the nondefault value of the ArrayView constructor's third argument, that tells it not to track memory. I think it's Teuchos::RCP_DISABLE_NODE_LOOKUP but try it to be sure. Change the line of code that calls Teuchos::arrayView to the following:

Teuchos::ArrayView<const GO> ids_av (ids, N, Teuchos::RCP_DISABLE_NODE_LOOKUP);
bartlettroscoe commented 8 years ago

As a temporary work-around, ...

No need for a hack. The first I heard of this was Andrew's email just earlier today. Give me a little time to fix this correctly.

I have produced a unit test in Teuchos that reproduces the problem but I am having trouble debugging the issue. I am trying to work on the machine hansen with the default env devpack/openmpi/1.10.0/gcc/4.8.4/cuda/7.5.18. I can't get BinUtils to enable so I can't generate a stack trace. When I run GDB on the code, it can't get it to print anything. This system is impossible to work on. The only way that I am going to be able to debug on this machine is to write print statements.

I have pushed this to the branch array-view-raw-const-184 in my GitHub fork of Trilinos.

I will move to an old machine at ORNL where I believe that BinUtils and GDB work. I have not been able to find a single usable machine at Sandia on which to do development yet.

mhoemmen commented 8 years ago

No need for a hack. The first I heard of this was Andrew's email just earlier today. Give me a little time to fix this correctly.

No hurry :-) Just wanted to document a possible work-around. There's some old Bugzilla issue (closed?) that addresses this too, but I can't find it at the moment (searches for RCP_ENABLE_NODE_LOOKUP or RCP_DISABLE_NODE_LOOKUP turn up nothing).

bartlettroscoe commented 8 years ago

I believe I have the fix. It is a oneline fix. Turns out the simple workaround would have been to write:

Teuchos::ArrayView<const GO>(ids, N);

instead of:

Teuchos::ArrayView<const GO> ids_av = Teuchos::arrayView(ids, N);

The problem with the latter was with the getConst() conversion function going from a ArrayView to an ArrayView which was incorrectly created a weak reference (where there was no need to do so).

I am running the checkin-test.py script now to fully test and push (only the --default-builds).

mhoemmen commented 8 years ago

Ah, that's right, I recall that you had fixed the RCP_DISABLE_NODE_LOOKUP thing for the ArrayView constructor a while back, but it looks like the fix didn't get propagated to the nonmember "constructor" Teuchos::arrayView (case is significant). Thanks Ross!

bartlettroscoe commented 8 years ago

This should now be resolved. However, note that the reason that this defect was not caught before was code like:

  GO* ids = new GO[N];
  for (int i = 0; i < N; ++i) ids[i] = i;
  Teuchos::ArrayView<const GO> ids_av = Teuchos::arrayView(ids, N);
  Tpetra_Map m(N, ids_av, 0, comm);
  delete[] ids;

is very bad practice. It will leak memory if a throw occurs and it is generally much slower than using:

  Teuchos::Array<GO> ids(N);
  for (int i = 0; i < N; ++i) ids[i] = i;
  Tpetra_Map m(N, ids, 0, comm);

The first block of code violates the C++ coding standard (by Sutter and Alexandrescu) #13 "Ensure resources are owned by objects. Use explicit RAII and smart pointers."

But regardless, if you find any other defects with these classes, please let me know ASAP so that tye can be fixed. Now that I am back at SNL, I have funding to fix this stuff.

bartlettroscoe commented 8 years ago

Ah, that's right, I recall that you had fixed the RCP_DISABLE_NODE_LOOKUP thing for the ArrayView constructor a while back, but it looks like the fix didn't get propagated to the nonmember "constructor" Teuchos::arrayView (case is significant)

Just for the record, this had nothing to do with the arrayView() function at all. The issue was the conversion from ArrayView<T> to ArrayView<const T>. That is were things got messed up (and we did not have a unit test for this use case). Basically, if you directly create an ArrayView object from a raw pointer, then there should be no operation that you can do with just ArrayView objects that should be able to create a dangling reference exception, period. It is only when you create an ArrayView view of a persisting storage object like Array or ArrayRCP that should be able to generate this error. If we find cases where this is not true, then we need to add unit tests and fix this.

mhoemmen commented 8 years ago

This should not be resolved.

You mean "this should be resolved" right? :-)

bartlettroscoe commented 8 years ago

Not => now

Common typos seem to make.

-Ross


From: Mark Hoemmen notifications@github.com Sent: Tuesday, March 8, 2016 8:28:35 PM To: trilinos/Trilinos Cc: Bartlett, Roscoe A Subject: [EXTERNAL] Re: [Trilinos] ArrayView in DEBUG build gives (probably) erroneous RCP message (#184)

This should not be resolved.

You mean "this should be resolved" right? :-)

Reply to this email directly or view it on GitHubhttps://github.com/trilinos/Trilinos/issues/184#issuecomment-194055861.

mhoemmen commented 8 years ago

No worries, just wanted to make sure :-)