nimble-dev / nimble

The base NIMBLE package for R
http://R-nimble.org
BSD 3-Clause "New" or "Revised" License
160 stars 24 forks source link

sporadic `test-waic` failure (possibly only on MacOS) #1368

Open paciorek opened 11 months ago

paciorek commented 11 months ago

We're seeing this test failure, possibly only on MacOS:

══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Error ('test-waic.R:585:5'): standalone offline WAIC works ───────────────
Error in `FUN(X[[i]], ...)`: no such symbol new_modelValuesConf_UID_172 in package /private/var/folders/3s/vfzpb5r51gs6y328rmlgzm7c0000gn/T/RtmpSNASp7/nimble_generatedCode/P_12_waicClass_offline_12_12_21_42_16.so
Backtrace:

I was able to reproduce this (fairly reliably but not every time I ran the specific test) on my laptop. Here's what seems to be going on:

Here's what we expect our system2 call to R CMD SHLIB to do:

clang++ -arch arm64 -std=gnu++17 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DR_NO_REMAP   -DEIGEN_MPL2_ONLY=1 -I"/private/tmp/nim-102/nimble/include" -Wno-misleading-indentation -Wno-ignored-attributes -Wno-deprecated-declarations  -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c P_11_waicClass_offline.cpp -o P_11_waicClass_offline.o
clang++ -arch arm64 -std=gnu++17 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o P_11_waicClass_offline_12_12_14_04_13.so P_11_waicClass_offline.o -L/private/tmp/nim-102/nimble/CppCode -lnimble -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

As indicated above only the second clang++ call occurs in the cases when the error occurs, despite the SHLIBargs in cppProjectClass$compileFile being the same.

Not sure we want to go this route but using R CMD SHLIB --preclean seems to avoid the issue, though since it is sporadic, I can't 100% guarantee it. But in many repeated tries, it no longer had the error. Or perhaps we would generally want to use R CMD SHLIB --clean with the thought that we don't need the various .o files anyway so cleaning them up could make good sense generally.

I'm hoping that because this seems to occur with repeated compilations in this specific situation where the same .cpp/.h/.o file names are used repeatedly that it is not something users would encounter. I don't think we've gotten similar reports though I have a vague feeling we've seen "no such symbol" report(s) before.

Before I do anything more with this, I'll plan to discuss with @perrydv .

perrydv commented 10 months ago

@paciorek I have spent a while digging into this. I'm as baffled as you are. From a shell I am using symbols P_1_waicClass_offline.o | grep modelValues to see the relevant symbols in the latest compiled result. This also works with a .so file as input. I can see the state that occurs when, after our R CMD SHLIB call, the "new_modelValuesConf_UID_XX" symbol has XX (some number from our counter) that is one behind what is in the .h and .cpp files. Essentially this is just finding a way to look more directly, from outside of R, and see the mismatched state where the .o has evidently not been updated.

Also nimbleOptions(pauseAfterWritingFiles=TRUE) has been helpful in this digging.

I did reproduce this problem on 1.0.1 (current release). I am wondering if it has to do with changes in R or OS X tool chains, unlikely as that seems. I do have ccache running, but we are seeing this on your system and @danielturek 's system and the CI build systems, so unless they all have ccache running, that doesn't seem likely to be the culprit.

I like the "--preclean" option. It would be easy to insert. It would be a big change in that it would touch every compilation call we make. But it also makes sense.

What do you think?

paciorek commented 10 months ago

Yeah, I think your debugging/digging steps were similar to mine.

I'm a bit wary of making the --preclean option universal given we haven't heard any reports of this apart from our testing, though I agree it makes sense so maybe it would just be a good thing to do.

One thought is that we set up a nimble option to cause --preclean to be invoked but have it default to FALSE. For testing we can set it to TRUE (perhaps just for the problematic waic testing but perhaps more broadly) and it's in our back pocket in case we start seeing reports of the problem from users.

perrydv commented 10 months ago

Here is a sporadically reproducible nearly-minimal example entirely outside of nimble.

As file1.cpp:

#include "file1.h"

SEXP foo_1() {
  Rprintf("hello world\n");
  return R_NilValue;
}

As file1.h:

#define R_NO_REMAP
#include "R.h"
#include "Rinternals.h"
#include "Rdefines.h"

extern "C" SEXP foo_1();

As file2.cpp:

#include "file2.h"

SEXP foo_2() {
  return R_NilValue;
}

as file2.h

#include "R.h"
#include "Rinternals.h"
#include "Rdefines.h"

extern "C" SEXP foo_2();

Then from R, run:

system2("rm", "test*.*")

system2("cp", c("file1.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test1.so"))
system2("symbols", c("test1.so", "| grep foo"))
dll1 <- dyn.load("test1.so", local=TRUE)
getNativeSymbolInfo("foo_1", dll1) # should be valid

system2("cp", c("file2.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test2.so"))
system2("symbols", c("test2.so", "| grep foo"))
dll2 <- dyn.load("test2.so", local=TRUE)
getNativeSymbolInfo("foo_2", dll2) # should be valid, sometimes isn't
getNativeSymbolInfo("foo_1", dll2) # should not be valid, sometimes is

The .h and .cpp files might not be fully minimal yet, but close.

paciorek commented 10 months ago

Nice - extracting this out of the nimble context is a big step forward.

One thought here is that I could run this by our SCF sysadmin (Ryan Lovett).

I'm also going to try running on my new M2 Mac and see what happens.

paciorek commented 10 months ago

Ok, I've reproduced this on my M2 Mac.

If I insert a 3-second sleep before the file2.cpp compilation, everything seems fine. So it feels like a timestamp kind of thing.

paciorek commented 10 months ago

I think the following reproduces things using make.

Here's the Makefile:

test.o: test.cpp
    clang++ -arch arm64 -std=gnu++17 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/opt/R/arm64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c test.cpp -o test.o

test2.so: test.o
    clang++ -arch arm64 -std=gnu++17 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o test2.so test.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

Here's the reprex:

rm test*.*

cp file1.cpp test.cpp
R CMD SHLIB test.cpp -o test1.so
symbols test1.so | grep foo

cp file2.cpp test.cpp
make test2.so
symbols test2.so | grep foo
paciorek commented 10 months ago

Actually, I no longer think the sleep changes things. However if I manually edit test.cpp instead of using cp then things seem fine.

Ryan is also flummoxed.

paciorek commented 10 months ago

And one more thing -- I can reproduce the problem on Linux.

paciorek commented 10 months ago

I think I've changed my mind and agree with the idea of using --preclean by default given we seem to have isolated this to strange behavior of make.

perrydv commented 10 months ago

Here's more experimental evidence that the issue has to do with timestamps. The following code runs 100 iterations of the copy-compile scheme I set up above. For each one, it records whether the second compilation was properly done (the .so contains foo_2) or was not done (the .so still contains foo_1). If the Sys.sleep(1) is commented-out, in about 1/3 of iterations (33/100), the result is wrong (compilation not done).

When I extract timestamps using stat, we see that, in all cases when the timestamp of test.cpp (after cp file2.cp test.cpp) is tied with that of the previous test.o, the compilation is not updated (wrong). On the other hand, in many (but not all) cases when those two timestamps do not match, the compilation is updated (right). I speculate that the imperfect match of the latter is due to a "quantum" problem where the steps of the commands take bits of time themselves so observing them is imperfect. But it could just as well be that I am not sure if I am looking at the right one of the available times. I can't find much easy information from search results on getting it below the precision of whole seconds.

When Sys.sleep(1) is uncommented below, the problem never occurs.

I am somewhat baffled that the timing mechanisms used for builds can have this kind of problem/phenomenon. It does seem that --preclean may be the intended way to ensure we force compilation.

stat_test_cpp <- character(100)
stat_test_o <- character(100)
found_foo1 <- logical(100)
found_foo2 <- logical(100)
for(i in 1:100) {
  system2("rm", "test*.*")
  dyn.unload("test1.so")
  dyn.unload("test2.so")

  system2("cp", c("file1.cpp","test.cpp"))
  system2("R", c("CMD SHLIB", "test.cpp", "-o test1.so"))
  dll1 <- dyn.load("test1.so", local=TRUE)

#  Sys.sleep(1.0)
  system2("cp", c("file2.cpp","test.cpp"))
  system2("R", c("CMD SHLIB", "test.cpp", "-o test2.so"))
  dll2 <- dyn.load("test2.so", local=TRUE)
  find_foo2 <- try(getNativeSymbolInfo("foo_2", dll2)) # should be valid, sometimes isn't
  find_foo1 <- try(getNativeSymbolInfo("foo_1", dll2)) # should not be valid, sometimes is

  found_foo2[i] <- !inherits(find_foo2, "try-error")
  found_foo1[i] <- !inherits(find_foo1, "try-error")

  system2("stat","test.cpp > stat_test_cpp.txt")
  stat_test_cpp[i] <- readLines("stat_test_cpp.txt")
  system2("stat","test.o > stat_test_o.txt")
  stat_test_o[i] <- readLines("stat_test_o.txt")
}
sum(found_foo1) # how many times did compilation NOT occur (wrong)
sum(found_foo2) # how many time did compilation occur (right)

# pull out "birthtime" timestamps and see when they match between the two stat calls
stat_test_cpp_split <- lapply(stat_test_cpp, \(x) strsplit(x, "\""))
stat_test_o_split <- lapply(stat_test_o, \(x) strsplit(x, "\""))
matches <- mapply(\(x1, x2) identical(x1[[1]][8], x2[[1]][8]), stat_test_cpp_split, stat_test_o_split)

cbind(matches, found_foo1) # align when the timestamps match and when foo_1 was found (indicating that compilation did not occur)
paciorek commented 10 months ago

I'm preparing a branch with --preclean where that is shielded by a nimbleOption but defaults to TRUE.

paciorek commented 9 months ago

Just a note that at least for Laplace, this adds an extra, unneeded compilation of nimbleCppADbaseClass.cpp that, for the crossed random effects example in test-ADlaplace.R adds an additional 20 seconds of C++ compilation time, taking the total time from something like 60 sec. to something like 80 sec on my Linux machine.

g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG -DR_NO_REMAP   -DEIGEN_MPL2_ONLY=1 -I"/tmp/nim-devel/nimble/include" -Wno-misleading-indentation -Wno-ignored-attributes -Wno-deprecated-declarations      -fpic  -g -O2 -ffile-prefix-map=/build/r-base-H0vbME/r-base-4.3.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c nimbleCppADbaseClass.cpp -o nimbleCppADbaseClass.o

So at some point we might want to look more carefully at managing what is recompiled more directly.