trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 566 forks source link

Phalanx: array out of bounds error in warning-free nightly build #12418

Closed ikalash closed 11 months ago

ikalash commented 1 year ago

Our warning-free nightly build on Blake for Albany has the following error:

/home/projects/albany/nightlyCDashTrilinosBlake/build-gcc/TrilinosDebugInstallGccNoWarn/include/Phalanx_DataLayout_MDALayout_Def.hpp:602:20: error: array subscript 1 is above array bounds of 'const Kokkos::Serial::size_type [1]' {aka 'const long unsigned int [1]'} [-Werror=array-bounds]

https://sems-cdash-son.sandia.gov/cdash/viewBuildError.php?buildid=53327

Could someone please have a look?

@trilinos/phalanx

rppawlo commented 1 year ago

@ikalash - the last change to phalanx src code was in July. I would need more information about how the data layouts are being generated in Albany. There was a big set of changes pushed to intrepid2 yesterday. Maybe those have broken the data layout generation in Albany? Just a guess.

mperego commented 1 year ago

This is a new build on Blake. The error was already there on Oct 13. I don't think is related to Intrepid2. Irina, could it be an Albany issue?

mperego commented 1 year ago

@rppawlo I think the issue is that this code https://github.com/trilinos/Trilinos/blob/develop/packages/phalanx/src/Phalanx_KokkosView_CreateView.hpp#L80-L83 is compiled also when the rank is 1, and that's why the compiler complains that we are accessing an array out of bounds when calling dl.dimension(1).

rppawlo commented 1 year ago

Can you send me instructions to reproduce on blake? Do the phalanx tests build and run?

mperego commented 1 year ago

I think you can use these modules and scripts (Irina, please check): https://github.com/sandialabs/Albany/blob/master/doc/dashboards/blake.sandia.gov/blake_gcc_modules.sh https://github.com/sandialabs/Albany/blob/master/doc/dashboards/blake.sandia.gov/do-cmake-trilinos-gcc-debug

I don't think we build Phalanx tests.

I'm not very familiar with constexpr, but it might be possible to declare rank() as a const expression in the MDALayout:

constexpr PHX::Device::size_type 
PHX::MDALayout<Tag0,Tag1,Tag2,Tag3,Tag4,Tag5,Tag6,Tag7>::rank() const
{ return Rank; }

and then use if constexpr(dl.rank() == 1) so that only the body of the correct condition gets compiled

rppawlo commented 1 year ago

@mperego - that was my first thought too. But I just used the albany build and all phalanx and panzer tests/examples build successfully and pass testing. I'd need to see the albany code and probably dump the runtime dag to see what is going on.

ikalash commented 1 year ago

Those should be the correct scripts to use. Thanks for looking at it @rppawlo . I don’t know if ohalan tests pass. I don’t think they are enabled in the nightlies.

ikalash commented 1 year ago

@rppawlo : any updates on this issue? Unfortunately it is still happening in our nightlies: https://sems-cdash-son.sandia.gov/cdash/viewBuildError.php?buildid=53488 .

rppawlo commented 1 year ago

@ikalash - I used the albany build scripts and all of the phalanx and panzer tests build and pass on blake. I suspect this is something in Albany or it could be a use case not covered under phalanx testing. I need to build albany to figure this out. I will try to do it next week. Haven't built albany in probably 5 years...

bartgol commented 1 year ago

@rppawlo I have a very small reproducer, which shows some hint toward what Mauro suggests (i.e., the compiler can capture some type at compile time rather than deferring until run time). I put all the files on blake, in /home/projects/albany/forRoger (src and bld folders). After you load all the modules according to the file linked by Mauro, you can run the do-cmake.sh script in bld, and build.

Now, regarding the source code. Originally, I put all the stuff in the main file, and got the error. What striked me was that the compiler should not know that the layout has 1 dimension, since it gets it from the (runtime) FieldTag obj. But since the layout is created in the same scope where createView is called, I wondered whether the compiler was smart enough to pass static info (the fact that the layout has rank 1) to the instantiated createView call, trying to compile away stuff (even though it's odd for a DEBUG build). So I moved the creation of the layout in a second cpp file. In main.cpp you can see

  auto dl = Teuchos::rcp(new PHX::MDALayout<Dummy>(0));   // Will get the error!
  //auto dl = get_layout ();                                // Will work!
  PHX::Tag<ScalarT> tag("hello world!", dl);

If you swap the comment between the first two lines, voila', it builds!

So this suggests Mauro was on the right track. I'm not entierly sure why the compiler can use compile-time info from MDALayout when accessing t.dataLayout().dimension(X) in createView, but that's what I think it's happening.

The fact that a small reproducer like this can show the behavior tells me this is not a compiler error, but maybe a "compiler feature" (annoying as it may be). If Mauro is right, using if constexpr when trying to access the layouts dimensions may fix it. I can try to do it myself tomorrow...

Let me know what you think.

ikalash commented 1 year ago

Thanks for looking into this, @bartgol ! I have a marginally-related question. I just had a look on blake into why the Trilinos build is being reported as failed on blake with the warning-free configure options, and I think it's because there are warnings encountered in several Trilinos packages (seacas, ml, aztecoo, zoltan). I am attaching the output of the build. Do you agree that this is the problem? CDash is reporting something different but I think it's not right: https://sems-cdash-son.sandia.gov/cdash/viewBuildError.php?buildid=53585 . If so and if Trilinos supports a warning-free build, should we option a separate issue on the warnings?

nightly_log_blakeTrilinosGccDebug.txt

rppawlo commented 1 year ago

Thanks @bartgol ! Thats great news! I'm a little worried that the if constexpr will work for all of phalanx. The rank() in the if statement in the view allocator is a runtime parameter in general. if constexpr have to be evaluated at compile time. It might work for the issue above where a compiletime parameter sneaks through, but in general I suspect it will not work, especially for DynRankViews (if I understand your discussion above). I'm close to having the refactor of the MDALayout for variadic templates complete. That could be a path forward as well.

bartgol commented 1 year ago

@rppawlo Yes, based on what I observed and assuming we're right about the compiler attempting to compile away code, I think the variadic templates should remove this issue.

Let me know if I can help in any way.

rppawlo commented 1 year ago

The variadic templates is done. I'm running some tests now - building against panzer with various compilers. I'll put up the PR shortly and test on blake if everything passes.

ikalash commented 1 year ago

@rppawlo thanks! I'm happy to test this once the PR is up to verify that Albany builds cleanly with the fixed.

rppawlo commented 1 year ago

@ikalash - I've got an albany build going on blake now. Will post results soon. I will not merge this right away - also want to test against drekar, charon and empire. It's a big change to a fundmental class, want to be very careful with it.

ikalash commented 1 year ago

@rppawlo : ok sounds good. Thanks again!

rppawlo commented 1 year ago

Well, that did not fix the issue on blake. I will have to rewrite the viewCreate routine to split into separate functions based on rank. Will try to get to this next week. I will still push the layout changes after further testing.

rppawlo commented 1 year ago

Well, this is funny. The refactor meant to fix the issue on blake is now causing the gnu PR build to fail with the same error in phalanx tests. At least this will be faster to fix now that we don't have to deal with horrific filesystem on blake or pull Albany into the build.

mcarlson801 commented 11 months ago

Has this been resolved? It looks like there is a different array-bounds warning-as-error showing up on our blake build now so I figured I'd double check if it is related to this issue before I dig into it.

ikalash commented 11 months ago

Thanks for checking @mcarlson801 . There are still phalanx-related errors in the blake warning free build unfortunately. It's possible they are different than before. Thanks for looking into it!

rppawlo commented 11 months ago

The phalanx issue was fixed on blake. There were other albany failures not associated with phalanx that still need to be addressed on that platform.

rppawlo commented 11 months ago

@ikalash - I did not see any more from phalanx when fixing the first issue. Can you post the new failure?

ikalash commented 11 months ago

Here is the error now: https://sems-cdash-son.sandia.gov/cdash/viewBuildError.php?buildid=54316 . It's actually now about Sacado.

ikalash commented 11 months ago

Should I close this issue and open a new one?

mcarlson801 commented 11 months ago

@ikalash I think the current error is actually a bug related to an Albany enthalpy evaluator. Just wanted to double check it wasn't related to this.

ikalash commented 11 months ago

Yes, the problem seems to be coming from the enthalpy evaluator. I agree that it seems unrelated to this issue. Maybe you could open a new issue if you are looking at this, and we can close this one?