Open chris-smith-zocdoc opened 5 years ago
@chris-smith-zocdoc Thanks a lot for your very detailed investigation!
ping @imatiach-msft
also adding @eisber
@chris-smith-zocdoc interesting, I was able to create the program locally but for me I only see it succeed with logs like below (note I'm on an ubuntu machine):
11:51:44 PM: Executing task 'program.main()'...
Task :compileJava UP-TO-DATE Task :processResources NO-SOURCE Task :classes UP-TO-DATE
Task :program.main() BEGIN PROGRAM [LightGBM] [Warning] There are no meaningful features, as all feature values are constant. Done
BUILD SUCCESSFUL in 1s 2 actionable tasks: 1 executed, 1 up-to-date 11:51:45 PM: Task execution finished 'program.main()'.
I tried putting the sparse dataset generation code in a loop but it seems to pass:
for (int j = 0; j < 10000; j++) {
int numRow = 1000;
int numCols = 79;
SparseVector[] rows = new SparseVector[numRow];
for (int i = 0; i < numRow; i++) {
rows[i] = new SparseVector(numCols, new int[0], new double[0]);
}
SWIGTYPE_p_void dataset = generateSparseDataset(rows);
System.out.println("Done");
}
output: 12:09:30 AM: Executing task 'program.main()'...
Task :compileJava Task :processResources NO-SOURCE Task :classes
Task :program.main() BEGIN PROGRAM [LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
<10000 similar lines> .... ... [LightGBM] [Warning] There are no meaningful features, as all feature values are constant. Done
BUILD SUCCESSFUL in 48s 2 actionable tasks: 2 executed 12:10:18 AM: Task execution finished 'program.main()'.
one curious thing is that the crash on your cluster points to Booster create, and not the sparse dataset generation code. Actually, even more specifically, it seems to point to DCGCalculator init method, specifically on this line "/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x568)[0x7f112343e3d8]":
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f118a6c937a]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f118a6c79dc]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f118a6cd53c]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f118a6cacde]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x568)[0x7f112343e3d8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f118a6cd184]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/databricks/python/lib/libstdc++.so.6(_Znwm+0x16)[0x7f118ad6f097]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM7Booster25CreateObjectiveAndMetricsEv+0x21)[0x7f1123323051]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x463)[0x7f112343e2d3]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(LGBM_BoosterCreate+0x163)[0x7f1123314e13]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/local_disk0/tmp/mml-natives8293617996259236758/lib_lightgbm_swig.so(Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1BoosterCreate+0x3f)[0x7f112309be7f]
[0x7f11739f1407]
specifically this method: https://github.com/microsoft/LightGBM/blob/master/src/metric/dcg_calculator.cpp#L43 I wonder if there is something wrong with the DCG calculation, hard to tell without a repro
@imatiach-msft I'm not too concerned about that particular stack for the cluster crash yet, I think it may be the result of existing heap corruption from previous calls to LGBM_DatasetCreateFromCSRSpark
(note: I'm training multiple models in a loop on my cluster)
I've also noticed that repeat calls to generateSparseDataset
from the same jvm instance don't reproduce. To reproduce the crash I need to loop over the actual program invocation (so we get a fresh jvm each time)
# /bin/bash
count=0
x=0
while [ $x -eq 0 ]
do
echo "-------------------------------------------------------"
count=$(( $count + 1 ))
echo "run $count"
args="-XX:+UnlockDiagnosticVMOptions -Xcheck:jni -verbose:jni -Xint -Xdiag -XX:+LogCompilation -verbose:gc -Djava.compiler=NONE"
java $args -classpath [classpath] program
x=$?
done
echo "failed after $count attempts"
I'm running with -Xcheck:jni
because it makes the problem easier to spot, without it we get several different types of undefined behavior. The other options I've enabled don't seem to affect the crash behavior, just including them for completeness.
Here's an example crash without setting -Xcheck:jni
. In this one we see malloc failures because jenv->GetArrayLength(indices)
returned a negative number
[LightGBM] [Warning] GetArrayLength is less than zero, this is going to end poorly
[LightGBM] [Warning] GetIntArrayElements
java(23481,0x7000055ba000) malloc: can't allocate region
*** mach_vm_map(size=18446744072484265984) failed (error code=3)
java(23481,0x7000055ba000) malloc: *** set a breakpoint in malloc_error_break to debug
[LightGBM] [Warning] GetDoubleArrayElements
[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[LightGBM] [Warning]
Calling class is: java.lang.String
[LightGBM] [Warning]
WTF ITS A STRING
[LightGBM] [Warning] {"type":0,"size":79,"indices":[],"values":[]}
[LightGBM] [Warning] CallObjectMethod values
[LightGBM] [Warning] GetArrayLength; -1902086440
[LightGBM] [Warning] GetArrayLength: -306321787
[LightGBM] [Warning] GetArrayLength is less than zero, this is going to end poorly
[LightGBM] [Warning] GetIntArrayElements
java(23481,0x7000055ba000) malloc: can't allocate region
*** mach_vm_map(size=18446744072484265984) failed (error code=3)
java(23481,0x7000055ba000) malloc: *** set a breakpoint in malloc_error_break to debug
[LightGBM] [Warning] GetDoubleArrayElements
[LightGBM] [Warning] return -1
Exception in thread "main" java.lang.Exception: Dataset create call failed in LightGBM with error: allocator<T>::allocate(size_t n) 'n' exceeds maximum supported size
at program.validate(program.java:12)
at program.generateSparseDataset(program.java:104)
at program.main(program.java:138)
failed after 65 attempts
Here's a different one with a large positive number
BEGIN PROGRAM
class org.apache.spark.mllib.linalg.SparseVector
[LightGBM] [Warning] FindClass
[LightGBM] [Warning] GetMethodID index
[LightGBM] [Warning] GetMethodID values
[LightGBM] [Warning] reserve rows: 10
[LightGBM] [Warning] num cols: 79
[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[LightGBM] [Warning] Call to sparseVectorIndices failed
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: You provided 212361388 indices and values, which exceeds the specified vector size 79.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.mllib.linalg.SparseVector.<init>(Vectors.scala:763)
at org.apache.spark.mllib.linalg.SparseVector.toSparseWithSize(Vectors.scala:850)
at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
at program.generateSparseDataset(program.java:104)
at program.main(program.java:138)
I'll try to get this setup on Linux to rule out any local environmental issues also.
@chris-smith-zocdoc I think this is this issue here: https://github.com/microsoft/LightGBM/pull/2364 that error above looks very similar
the error is coming from this code: private static SWIGTYPE_p_int64_t intToPtr(int value) { SWIGTYPE_p_long longPtr = lightgbmlib.new_longp(); lightgbmlib.longp_assign(longPtr, value); return lightgbmlib.long_to_int64_t_ptr(longPtr); } long on some systems is smaller size than int64_t, so when casting the upper bits become random and the size becomes a large value. It's reproducible on almost every invocation on windows, but on linux it works fine.
I was just debugging this last night.
There's a corresponding PR on mmlspark side that I need to send out, will post it tonight. But with that include PR I sent (#2364 ) you can pass the long value directly and remove the intToPtr method in your code - hopefully then you shouldn't see this issue anymore
I tried with that change but am still seeing the same behavior. The new method without the intToPtr
method is
private static SWIGTYPE_p_void generateSparseDataset(SparseVector[] sparseRows) throws Exception {
// https://github.com/Azure/mmlspark/blob/360f2f7d8116a931bf373874cd558c43d7d98973/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L240
int numCols = sparseRows[0].size();
SWIGTYPE_p_p_void datasetOutPtr = lightgbmlib.voidpp_handle();
String datasetParams = "max_bin=255 is_pre_partition=True";
// Generate the dataset for features
validate(lightgbmlib.LGBM_DatasetCreateFromCSRSpark(
sparseRows,
sparseRows.length,
numCols,
datasetParams,
null,
datasetOutPtr),
"Dataset create");
return lightgbmlib.voidpp_value(datasetOutPtr);
}
num_col
never appeared directly as the problem in my output. Its the result of auto indices = (jintArray) jenv->CallObjectMethod(objSparseVec, sparseVectorIndices);
that isn't an array but some other object
On linux I'm able to reproduce the issue with check:jni but not without it, which is interesting.
One other difference is that the failure occurs calling GetDoubleArrayElements
instead of GetIntArrayElements
or GetArrayLength
root@188164040d84:/src/lightgbm-demo# java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
[LightGBM] [Warning] CallObjectMethod values
[Dynamic-linking native method java.io.FileInputStream.available0 ... JNI]
[Dynamic-linking native method sun.reflect.ConstantPool.getUTF8At0 ... JNI]
[Dynamic-linking native method sun.reflect.NativeMethodAccessorImpl.invoke0 ... JNI]
[Dynamic-linking native method java.lang.reflect.Proxy.defineClass0 ... JNI]
[Dynamic-linking native method java.util.TimeZone.getSystemTimeZoneID ... JNI]
[LightGBM] [Warning] GetArrayLength; -1073186848
[LightGBM] [Warning] GetArrayLength: 0
[LightGBM] [Warning] GetIntArrayElements
[LightGBM] [Warning] GetDoubleArrayElements
FATAL ERROR in native method: Non-array passed to JNI array operations
at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
at program.generateSparseDataset(program.java:26)
at program.main(program.java:60)
@chris-smith-zocdoc thanks for the repro, will take a look this weekend to see if I can get to the bottom of those failures. I suspect it is probably something similar to the windows issue where the sizes for some types are different on different systems - at least based on seeing large and negative values in the stack traces above and in the malloc errors.
I'm seeing the same error with jni:strict on linux, not sure how to fix it yet though
FATAL ERROR in native method: Non-array passed to JNI array operations
at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
at program.generateSparseDataset(program.java:28)
at program.main(program.java:60)
@imatiach-msft Have you tried reproducing this again after adding the JNI checks? https://github.com/microsoft/LightGBM/issues/2360
would be interesting to see if we get any better errors
@chris-smith-zocdoc Are you still facing this error?
On the latest master of both LightGBM + mmlspark I am no longer seeing crashes. I'll re-open if they appear again https://github.com/microsoft/LightGBM/issues/2392#issuecomment-540001969
I've been using dense vectors to avoid this method for my current work, I'll try it again later and report back
@chris-smith-zocdoc Any news?
I'm still seeing this bug https://github.com/microsoft/LightGBM/issues/2392#issuecomment-545192286 so I'd be hard to tell them apart at this point
@imatiach-msft is this fixed? Is there anything I can help?
I'm seeing jvm crashes in our spark cluster which I believe are being caused by
LGBM_DatasetCreateFromCSRSpark
https://github.com/microsoft/LightGBM/issues/2360 indicated some issues in that method, so I setup a local stress test that just invokes a tiny java application in a loop. Doing this I was quite easily (> 1 out of 100 times) able to elicit a jvm crash but its not exactly what I'm seeing in our spark cluster, though I think they are probably related.
Under the same local stress test, calling
LGBM_DatasetCreateFromMat
does not lead to crashes.I added some additional logging to
LGBM_DatasetCreateFromCSRSpark
and recompiled locally to help diagnose where the issue. I added log lines right before the method being called in lightgbmlib.iExample logs from good run
Crashing execution
Everything looks similar to the good run except we end up with a crash. There are extra
Dynamic-linking native
lines most times, but I'm not sure if they are the cause or effect.So the JVM is telling us that indices is not a valid array, but it is still a valid jobject which allows us to inspect it. What I'm observing here makes no sense to me however. The type of the object varies on each crash; sometimes it is a string, sometimes is a class.
When its a string, it tends to be
{"type":0,"size":79,"indices":[],"values":[]}
which is the result of calling.toJson()
on the sparse vectorWhen its a class, it tends to be
org.apache.spark.ml.linalg.SparseVector
which is the class we were supposed to call the method onAll this seems to indicate that we're somehow calling the wrong method here
Environment
Spark cluster is described in https://github.com/microsoft/LightGBM/issues/2360
For this local test OSX 10.14.6 Using a local build of LightGBM and lightgbmlib.jar
Program to reproduce
Grade file
Example crash from our spark cluster