Open paciorek opened 11 months ago
@paciorek I have spent a while digging into this. I'm as baffled as you are. From a shell I am using symbols P_1_waicClass_offline.o | grep modelValues
to see the relevant symbols in the latest compiled result. This also works with a .so file as input. I can see the state that occurs when, after our R CMD SHLIB
call, the "new_modelValuesConf_UID_XX" symbol has XX (some number from our counter) that is one behind what is in the .h and .cpp files. Essentially this is just finding a way to look more directly, from outside of R, and see the mismatched state where the .o has evidently not been updated.
Also nimbleOptions(pauseAfterWritingFiles=TRUE)
has been helpful in this digging.
I did reproduce this problem on 1.0.1 (current release). I am wondering if it has to do with changes in R or OS X tool chains, unlikely as that seems. I do have ccache running, but we are seeing this on your system and @danielturek 's system and the CI build systems, so unless they all have ccache running, that doesn't seem likely to be the culprit.
I like the "--preclean" option. It would be easy to insert. It would be a big change in that it would touch every compilation call we make. But it also makes sense.
What do you think?
Yeah, I think your debugging/digging steps were similar to mine.
I'm a bit wary of making the --preclean
option universal given we haven't heard any reports of this apart from our testing, though I agree it makes sense so maybe it would just be a good thing to do.
One thought is that we set up a nimble option to cause --preclean
to be invoked but have it default to FALSE. For testing we can set it to TRUE (perhaps just for the problematic waic testing but perhaps more broadly) and it's in our back pocket in case we start seeing reports of the problem from users.
Here is a sporadically reproducible nearly-minimal example entirely outside of nimble.
As file1.cpp:
#include "file1.h"
SEXP foo_1() {
Rprintf("hello world\n");
return R_NilValue;
}
As file1.h:
#define R_NO_REMAP
#include "R.h"
#include "Rinternals.h"
#include "Rdefines.h"
extern "C" SEXP foo_1();
As file2.cpp:
#include "file2.h"
SEXP foo_2() {
return R_NilValue;
}
as file2.h
#include "R.h"
#include "Rinternals.h"
#include "Rdefines.h"
extern "C" SEXP foo_2();
Then from R, run:
system2("rm", "test*.*")
system2("cp", c("file1.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test1.so"))
system2("symbols", c("test1.so", "| grep foo"))
dll1 <- dyn.load("test1.so", local=TRUE)
getNativeSymbolInfo("foo_1", dll1) # should be valid
system2("cp", c("file2.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test2.so"))
system2("symbols", c("test2.so", "| grep foo"))
dll2 <- dyn.load("test2.so", local=TRUE)
getNativeSymbolInfo("foo_2", dll2) # should be valid, sometimes isn't
getNativeSymbolInfo("foo_1", dll2) # should not be valid, sometimes is
The .h and .cpp files might not be fully minimal yet, but close.
Nice - extracting this out of the nimble context is a big step forward.
One thought here is that I could run this by our SCF sysadmin (Ryan Lovett).
I'm also going to try running on my new M2 Mac and see what happens.
Ok, I've reproduced this on my M2 Mac.
If I insert a 3-second sleep before the file2.cpp compilation, everything seems fine. So it feels like a timestamp kind of thing.
I think the following reproduces things using make.
Here's the Makefile:
test.o: test.cpp
clang++ -arch arm64 -std=gnu++17 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c test.cpp -o test.o
test2.so: test.o
clang++ -arch arm64 -std=gnu++17 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o test2.so test.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
Here's the reprex:
rm test*.*
cp file1.cpp test.cpp
R CMD SHLIB test.cpp -o test1.so
symbols test1.so | grep foo
cp file2.cpp test.cpp
make test2.so
symbols test2.so | grep foo
Actually, I no longer think the sleep
changes things.
However if I manually edit test.cpp
instead of using cp
then things seem fine.
Ryan is also flummoxed.
And one more thing -- I can reproduce the problem on Linux.
I think I've changed my mind and agree with the idea of using --preclean
by default given we seem to have isolated this to strange behavior of make
.
Here's more experimental evidence that the issue has to do with timestamps. The following code runs 100 iterations of the copy-compile scheme I set up above. For each one, it records whether the second compilation was properly done (the .so contains foo_2) or was not done (the .so still contains foo_1). If the Sys.sleep(1)
is commented-out, in about 1/3 of iterations (33/100), the result is wrong (compilation not done).
When I extract timestamps using stat
, we see that, in all cases when the timestamp of test.cpp (after cp file2.cp test.cpp
) is tied with that of the previous test.o, the compilation is not updated (wrong). On the other hand, in many (but not all) cases when those two timestamps do not match, the compilation is updated (right). I speculate that the imperfect match of the latter is due to a "quantum" problem where the steps of the commands take bits of time themselves so observing them is imperfect. But it could just as well be that I am not sure if I am looking at the right one of the available times. I can't find much easy information from search results on getting it below the precision of whole seconds.
When Sys.sleep(1)
is uncommented below, the problem never occurs.
I am somewhat baffled that the timing mechanisms used for builds can have this kind of problem/phenomenon. It does seem that --preclean
may be the intended way to ensure we force compilation.
stat_test_cpp <- character(100)
stat_test_o <- character(100)
found_foo1 <- logical(100)
found_foo2 <- logical(100)
for(i in 1:100) {
system2("rm", "test*.*")
dyn.unload("test1.so")
dyn.unload("test2.so")
system2("cp", c("file1.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test1.so"))
dll1 <- dyn.load("test1.so", local=TRUE)
# Sys.sleep(1.0)
system2("cp", c("file2.cpp","test.cpp"))
system2("R", c("CMD SHLIB", "test.cpp", "-o test2.so"))
dll2 <- dyn.load("test2.so", local=TRUE)
find_foo2 <- try(getNativeSymbolInfo("foo_2", dll2)) # should be valid, sometimes isn't
find_foo1 <- try(getNativeSymbolInfo("foo_1", dll2)) # should not be valid, sometimes is
found_foo2[i] <- !inherits(find_foo2, "try-error")
found_foo1[i] <- !inherits(find_foo1, "try-error")
system2("stat","test.cpp > stat_test_cpp.txt")
stat_test_cpp[i] <- readLines("stat_test_cpp.txt")
system2("stat","test.o > stat_test_o.txt")
stat_test_o[i] <- readLines("stat_test_o.txt")
}
sum(found_foo1) # how many times did compilation NOT occur (wrong)
sum(found_foo2) # how many time did compilation occur (right)
# pull out "birthtime" timestamps and see when they match between the two stat calls
stat_test_cpp_split <- lapply(stat_test_cpp, \(x) strsplit(x, "\""))
stat_test_o_split <- lapply(stat_test_o, \(x) strsplit(x, "\""))
matches <- mapply(\(x1, x2) identical(x1[[1]][8], x2[[1]][8]), stat_test_cpp_split, stat_test_o_split)
cbind(matches, found_foo1) # align when the timestamps match and when foo_1 was found (indicating that compilation did not occur)
I'm preparing a branch with --preclean
where that is shielded by a nimbleOption but defaults to TRUE.
Just a note that at least for Laplace, this adds an extra, unneeded compilation of nimbleCppADbaseClass.cpp that, for the crossed random effects example in test-ADlaplace.R
adds an additional 20 seconds of C++ compilation time, taking the total time from something like 60 sec. to something like 80 sec on my Linux machine.
g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG -DR_NO_REMAP -DEIGEN_MPL2_ONLY=1 -I"/tmp/nim-devel/nimble/include" -Wno-misleading-indentation -Wno-ignored-attributes -Wno-deprecated-declarations -fpic -g -O2 -ffile-prefix-map=/build/r-base-H0vbME/r-base-4.3.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -c nimbleCppADbaseClass.cpp -o nimbleCppADbaseClass.o
So at some point we might want to look more carefully at managing what is recompiled more directly.
We're seeing this test failure, possibly only on MacOS:
I was able to reproduce this (fairly reliably but not every time I ran the specific test) on my laptop. Here's what seems to be going on:
calculateWAIC
in lines 562-572, often line 567 or line 572.172
label is in the .cpp/.h file. It is also172
in the interface function in R. However the symbol in the .so file is an "old" one, with a label like170
from one of the previous compilations involved in the repeated calls tocalculateWAIC
.calculateWAIC
fails, theP_<ID>_waicClass_offline.o
file is not being regenerated. Only theclang++
call that creates the .so file from the .o file is invoked, not theclang++
call that creates the .o file from the .cpp file. As a result the old170
symbol is in the .so file. [see below for more details]R CMD SHLIB
, which apparently usesmake
(I haven't looked into details).make
that causes it to think that the .cpp/.h files haven't changed and therefore that the .o file doesn't need to be regenerated, even thoughls -l
clearly shows that the .o file is older than the .cpp/.h files.Here's what we expect our
system2
call toR CMD SHLIB
to do:As indicated above only the second clang++ call occurs in the cases when the error occurs, despite the
SHLIBargs
incppProjectClass$compileFile
being the same.Not sure we want to go this route but using
R CMD SHLIB --preclean
seems to avoid the issue, though since it is sporadic, I can't 100% guarantee it. But in many repeated tries, it no longer had the error. Or perhaps we would generally want to useR CMD SHLIB --clean
with the thought that we don't need the various .o files anyway so cleaning them up could make good sense generally.I'm hoping that because this seems to occur with repeated compilations in this specific situation where the same .cpp/.h/.o file names are used repeatedly that it is not something users would encounter. I don't think we've gotten similar reports though I have a vague feeling we've seen "no such symbol" report(s) before.
Before I do anything more with this, I'll plan to discuss with @perrydv .