microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.53k stars 3.82k forks source link

JVM crash caused by LGBM_DatasetCreateFromCSRSpark #2363

Open chris-smith-zocdoc opened 5 years ago

chris-smith-zocdoc commented 5 years ago

I'm seeing jvm crashes in our spark cluster which I believe are being caused by LGBM_DatasetCreateFromCSRSpark

https://github.com/microsoft/LightGBM/issues/2360 indicated some issues in that method, so I setup a local stress test that just invokes a tiny java application in a loop. Doing this I was quite easily (> 1 out of 100 times) able to elicit a jvm crash but its not exactly what I'm seeing in our spark cluster, though I think they are probably related.

Under the same local stress test, calling LGBM_DatasetCreateFromMat does not lead to crashes.

I added some additional logging to LGBM_DatasetCreateFromCSRSpark and recompiled locally to help diagnose where the issue. I added log lines right before the method being called in lightgbmlib.i

Example logs from good run

BEGIN PROGRAM
[Dynamic-linking native method java.lang.Package.getSystemPackage0 ... JNI]
[Dynamic-linking native method sun.reflect.NativeMethodAccessorImpl.invoke0 ... JNI]
[Dynamic-linking native method java.lang.Class.isInstance ... JNI]
[Dynamic-linking native method java.security.AccessController.doPrivileged ... JNI]
[Dynamic-linking native method java.lang.System.identityHashCode ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.new_longp ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.longp_assign ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.long_to_int64_t_ptr ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark ... JNI]
[LightGBM] [Warning] FindClass
[LightGBM] [Warning] GetMethodID index
[LightGBM] [Warning] GetMethodID values
[LightGBM] [Warning] reserve rows: 10
[LightGBM] [Warning] num cols: 79

[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[LightGBM] [Warning] CallObjectMethod values
[LightGBM] [Warning] GetArrayLength; -1077856400
[LightGBM] [Warning] GetArrayLength: 0
[LightGBM] [Warning] GetIntArrayElements
[LightGBM] [Warning] GetDoubleArrayElements

[ repeated 10 times]

[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Warning] return 0
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_value ... JNI]
Done

Crashing execution

Everything looks similar to the good run except we end up with a crash. There are extra Dynamic-linking native lines most times, but I'm not sure if they are the cause or effect.

[snip]

[LightGBM] [Warning] FindClass
[LightGBM] [Warning] GetMethodID index
[LightGBM] [Warning] GetMethodID values
[LightGBM] [Warning] reserve rows: 10
[LightGBM] [Warning] num cols: 79

[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[Dynamic-linking native method sun.reflect.ConstantPool.getUTF8At0 ... JNI]
[Dynamic-linking native method java.lang.reflect.Proxy.defineClass0 ... JNI]
[Dynamic-linking native method java.util.TimeZone.getSystemTimeZoneID ... JNI]
[LightGBM] [Warning] CallObjectMethod values
[LightGBM] [Warning] GetArrayLength; 1045488784
FATAL ERROR in native method: Non-array passed to JNI array operations
    at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
    at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
    at program.generateSparseDataset(program.java:104)
    at program.main(program.java:136)

So the JVM is telling us that indices is not a valid array, but it is still a valid jobject which allows us to inspect it. What I'm observing here makes no sense to me however. The type of the object varies on each crash; sometimes it is a string, sometimes is a class.

When its a string, it tends to be {"type":0,"size":79,"indices":[],"values":[]} which is the result of calling .toJson() on the sparse vector

When its a class, it tends to be org.apache.spark.ml.linalg.SparseVector which is the class we were supposed to call the method on

All this seems to indicate that we're somehow calling the wrong method here

Environment

Spark cluster is described in https://github.com/microsoft/LightGBM/issues/2360

For this local test OSX 10.14.6 Using a local build of LightGBM and lightgbmlib.jar

Program to reproduce

import com.microsoft.ml.lightgbm.*;
import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.mllib.linalg.SparseVector;
import java.util.ArrayList;

public class program {

    private static void validate(int result, String component) throws Exception {
        if (result == -1) {
            throw new Exception(component + " call failed in LightGBM with error: " + lightgbmlib.LGBM_GetLastError());
        }
    }

    private static SWIGTYPE_p_int64_t intToPtr(int value) {
        SWIGTYPE_p_long longPtr = lightgbmlib.new_longp();
        lightgbmlib.longp_assign(longPtr, value);
        return lightgbmlib.long_to_int64_t_ptr(longPtr);
    }

    private static SWIGTYPE_p_void generateSparseDataset(SparseVector[] sparseRows) throws Exception {
        // https://github.com/Azure/mmlspark/blob/360f2f7d8116a931bf373874cd558c43d7d98973/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L240
        int numCols = sparseRows[0].size();

        SWIGTYPE_p_p_void datasetOutPtr = lightgbmlib.voidpp_handle();
        String datasetParams = "max_bin=255 is_pre_partition=True";

        // Generate the dataset for features
        validate(lightgbmlib.LGBM_DatasetCreateFromCSRSpark(
                sparseRows,
                sparseRows.length,
                intToPtr(numCols),
                datasetParams,
                null,
                datasetOutPtr),
                "Dataset create");

        return lightgbmlib.voidpp_value(datasetOutPtr);
    }

    public static void main(String[] args) throws Exception {
        System.out.println("BEGIN PROGRAM");

        try {
            System.load("/full_path_to_lib/lib_lightgbm.dylib");
            System.load("/full_path_to_lib/lib_lightgbm_swig.jnilib");
        } catch (UnsatisfiedLinkError e) {
            System.err.println(e.getMessage());
            e.printStackTrace();
            return;
        }

        int numRow = 10;
        int numCols = 79;

        SparseVector[] rows = new SparseVector[numRow];
        for (int i = 0; i < numRow; i++) {
            rows[i] = new SparseVector(numCols, new int[0], new double[0]);
        }

        SWIGTYPE_p_void dataset = generateSparseDataset(rows);

        System.out.println("Done");
    }

Grade file

plugins {
    java
}

group = "whatever"
version = "1.0-SNAPSHOT"

repositories {
    mavenCentral()
}

dependencies {
    implementation(files("libs/lightgbmlib.jar"))
    implementation("org.apache.spark", "spark-mllib_2.12", "2.4.3")
    testCompile("junit", "junit", "4.12")
}

configure<JavaPluginConvention> {
    sourceCompatibility = JavaVersion.VERSION_1_8
}

Example crash from our spark cluster

19/08/26 21:19:22 INFO LightGBMRanker: LightGBM worker listening on: 12407
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 418231 rows and 79 columns
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 418052 rows and 79 columns
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 420478 rows and 79 columns
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 423064 rows and 79 columns
*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': double free or corruption (!prev): 0x0000000002c65ab0 ***
*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': corrupted size vs. prev_size: 0x00007f10e2b234d0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f118a6c937a]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f118a6c79dc]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f118a6cd53c]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f118a6cacde]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x568)[0x7f112343e3d8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f118a6cd184]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/databricks/python/lib/libstdc++.so.6(_Znwm+0x16)[0x7f118ad6f097]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM7Booster25CreateObjectiveAndMetricsEv+0x21)[0x7f1123323051]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x463)[0x7f112343e2d3]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(LGBM_BoosterCreate+0x163)[0x7f1123314e13]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/local_disk0/tmp/mml-natives8293617996259236758/lib_lightgbm_swig.so(Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1BoosterCreate+0x3f)[0x7f112309be7f]
[0x7f11739f1407]
======= Memory map: ========
StrikerRUS commented 5 years ago

@chris-smith-zocdoc Thanks a lot for your very detailed investigation!

ping @imatiach-msft

imatiach-msft commented 5 years ago

also adding @eisber

imatiach-msft commented 5 years ago

@chris-smith-zocdoc interesting, I was able to create the program locally but for me I only see it succeed with logs like below (note I'm on an ubuntu machine):

11:51:44 PM: Executing task 'program.main()'...

Task :compileJava UP-TO-DATE Task :processResources NO-SOURCE Task :classes UP-TO-DATE

Task :program.main() BEGIN PROGRAM [LightGBM] [Warning] There are no meaningful features, as all feature values are constant. Done

BUILD SUCCESSFUL in 1s 2 actionable tasks: 1 executed, 1 up-to-date 11:51:45 PM: Task execution finished 'program.main()'.

imatiach-msft commented 5 years ago

I tried putting the sparse dataset generation code in a loop but it seems to pass:

        for (int j = 0; j < 10000; j++) {
            int numRow = 1000;
            int numCols = 79;

            SparseVector[] rows = new SparseVector[numRow];
            for (int i = 0; i < numRow; i++) {
                rows[i] = new SparseVector(numCols, new int[0], new double[0]);
            }

            SWIGTYPE_p_void dataset = generateSparseDataset(rows);

            System.out.println("Done");
        }

output: 12:09:30 AM: Executing task 'program.main()'...

Task :compileJava Task :processResources NO-SOURCE Task :classes

Task :program.main() BEGIN PROGRAM [LightGBM] [Warning] There are no meaningful features, as all feature values are constant.

<10000 similar lines> .... ... [LightGBM] [Warning] There are no meaningful features, as all feature values are constant. Done

BUILD SUCCESSFUL in 48s 2 actionable tasks: 2 executed 12:10:18 AM: Task execution finished 'program.main()'.

imatiach-msft commented 5 years ago

one curious thing is that the crash on your cluster points to Booster create, and not the sparse dataset generation code. Actually, even more specifically, it seems to point to DCGCalculator init method, specifically on this line "/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x568)[0x7f112343e3d8]":

/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f118a6c937a]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f118a6c79dc]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f118a6cd53c]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f118a6cacde]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x568)[0x7f112343e3d8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f118a6cd184]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/databricks/python/lib/libstdc++.so.6(_Znwm+0x16)[0x7f118ad6f097]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM7Booster25CreateObjectiveAndMetricsEv+0x21)[0x7f1123323051]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x463)[0x7f112343e2d3]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(LGBM_BoosterCreate+0x163)[0x7f1123314e13]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/local_disk0/tmp/mml-natives8293617996259236758/lib_lightgbm_swig.so(Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1BoosterCreate+0x3f)[0x7f112309be7f]
[0x7f11739f1407]

specifically this method: https://github.com/microsoft/LightGBM/blob/master/src/metric/dcg_calculator.cpp#L43 I wonder if there is something wrong with the DCG calculation, hard to tell without a repro

chris-smith-zocdoc commented 5 years ago

@imatiach-msft I'm not too concerned about that particular stack for the cluster crash yet, I think it may be the result of existing heap corruption from previous calls to LGBM_DatasetCreateFromCSRSpark (note: I'm training multiple models in a loop on my cluster)

I've also noticed that repeat calls to generateSparseDataset from the same jvm instance don't reproduce. To reproduce the crash I need to loop over the actual program invocation (so we get a fresh jvm each time)

# /bin/bash

count=0
x=0
while [ $x -eq 0 ]
do
  echo "-------------------------------------------------------"
  count=$(( $count + 1 ))
  echo "run $count"

  args="-XX:+UnlockDiagnosticVMOptions -Xcheck:jni -verbose:jni -Xint -Xdiag -XX:+LogCompilation -verbose:gc -Djava.compiler=NONE"

  java $args -classpath [classpath] program
  x=$?
done

echo "failed after $count attempts"

I'm running with -Xcheck:jni because it makes the problem easier to spot, without it we get several different types of undefined behavior. The other options I've enabled don't seem to affect the crash behavior, just including them for completeness.

Here's an example crash without setting -Xcheck:jni. In this one we see malloc failures because jenv->GetArrayLength(indices) returned a negative number

[LightGBM] [Warning] GetArrayLength is less than zero, this is going to end poorly
[LightGBM] [Warning] GetIntArrayElements
java(23481,0x7000055ba000) malloc: can't allocate region
*** mach_vm_map(size=18446744072484265984) failed (error code=3)
java(23481,0x7000055ba000) malloc: *** set a breakpoint in malloc_error_break to debug
[LightGBM] [Warning] GetDoubleArrayElements
[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[LightGBM] [Warning]
Calling class is: java.lang.String

[LightGBM] [Warning]
WTF ITS A STRING

[LightGBM] [Warning] {"type":0,"size":79,"indices":[],"values":[]}
[LightGBM] [Warning] CallObjectMethod values
[LightGBM] [Warning] GetArrayLength; -1902086440
[LightGBM] [Warning] GetArrayLength: -306321787
[LightGBM] [Warning] GetArrayLength is less than zero, this is going to end poorly
[LightGBM] [Warning] GetIntArrayElements
java(23481,0x7000055ba000) malloc: can't allocate region
*** mach_vm_map(size=18446744072484265984) failed (error code=3)
java(23481,0x7000055ba000) malloc: *** set a breakpoint in malloc_error_break to debug
[LightGBM] [Warning] GetDoubleArrayElements
[LightGBM] [Warning] return -1
Exception in thread "main" java.lang.Exception: Dataset create call failed in LightGBM with error: allocator<T>::allocate(size_t n) 'n' exceeds maximum supported size
    at program.validate(program.java:12)
    at program.generateSparseDataset(program.java:104)
    at program.main(program.java:138)
failed after 65 attempts

Here's a different one with a large positive number

BEGIN PROGRAM
class org.apache.spark.mllib.linalg.SparseVector
[LightGBM] [Warning] FindClass
[LightGBM] [Warning] GetMethodID index
[LightGBM] [Warning] GetMethodID values
[LightGBM] [Warning] reserve rows: 10
[LightGBM] [Warning] num cols: 79
[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[LightGBM] [Warning] Call to sparseVectorIndices failed
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: You provided 212361388 indices and values, which exceeds the specified vector size 79.
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.mllib.linalg.SparseVector.<init>(Vectors.scala:763)
    at org.apache.spark.mllib.linalg.SparseVector.toSparseWithSize(Vectors.scala:850)
    at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
    at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
    at program.generateSparseDataset(program.java:104)
    at program.main(program.java:138)

I'll try to get this setup on Linux to rule out any local environmental issues also.

imatiach-msft commented 5 years ago

@chris-smith-zocdoc I think this is this issue here: https://github.com/microsoft/LightGBM/pull/2364 that error above looks very similar

the error is coming from this code: private static SWIGTYPE_p_int64_t intToPtr(int value) { SWIGTYPE_p_long longPtr = lightgbmlib.new_longp(); lightgbmlib.longp_assign(longPtr, value); return lightgbmlib.long_to_int64_t_ptr(longPtr); } long on some systems is smaller size than int64_t, so when casting the upper bits become random and the size becomes a large value. It's reproducible on almost every invocation on windows, but on linux it works fine.

I was just debugging this last night.

imatiach-msft commented 5 years ago

There's a corresponding PR on mmlspark side that I need to send out, will post it tonight. But with that include PR I sent (#2364 ) you can pass the long value directly and remove the intToPtr method in your code - hopefully then you shouldn't see this issue anymore

chris-smith-zocdoc commented 5 years ago

I tried with that change but am still seeing the same behavior. The new method without the intToPtr method is

private static SWIGTYPE_p_void generateSparseDataset(SparseVector[] sparseRows) throws Exception {
        // https://github.com/Azure/mmlspark/blob/360f2f7d8116a931bf373874cd558c43d7d98973/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L240
        int numCols = sparseRows[0].size();

        SWIGTYPE_p_p_void datasetOutPtr = lightgbmlib.voidpp_handle();
        String datasetParams = "max_bin=255 is_pre_partition=True";

        // Generate the dataset for features
        validate(lightgbmlib.LGBM_DatasetCreateFromCSRSpark(
                sparseRows,
                sparseRows.length,
                numCols,
                datasetParams,
                null,
                datasetOutPtr),
                "Dataset create");

        return lightgbmlib.voidpp_value(datasetOutPtr);
    }

num_col never appeared directly as the problem in my output. Its the result of auto indices = (jintArray) jenv->CallObjectMethod(objSparseVec, sparseVectorIndices); that isn't an array but some other object

chris-smith-zocdoc commented 5 years ago

On linux I'm able to reproduce the issue with check:jni but not without it, which is interesting.

One other difference is that the failure occurs calling GetDoubleArrayElements instead of GetIntArrayElements or GetArrayLength

root@188164040d84:/src/lightgbm-demo# java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
[LightGBM] [Warning] CallObjectMethod values
[Dynamic-linking native method java.io.FileInputStream.available0 ... JNI]
[Dynamic-linking native method sun.reflect.ConstantPool.getUTF8At0 ... JNI]
[Dynamic-linking native method sun.reflect.NativeMethodAccessorImpl.invoke0 ... JNI]
[Dynamic-linking native method java.lang.reflect.Proxy.defineClass0 ... JNI]
[Dynamic-linking native method java.util.TimeZone.getSystemTimeZoneID ... JNI]
[LightGBM] [Warning] GetArrayLength; -1073186848
[LightGBM] [Warning] GetArrayLength: 0
[LightGBM] [Warning] GetIntArrayElements
[LightGBM] [Warning] GetDoubleArrayElements
FATAL ERROR in native method: Non-array passed to JNI array operations
        at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
        at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
        at program.generateSparseDataset(program.java:26)
        at program.main(program.java:60)
imatiach-msft commented 5 years ago

@chris-smith-zocdoc thanks for the repro, will take a look this weekend to see if I can get to the bottom of those failures. I suspect it is probably something similar to the windows issue where the sizes for some types are different on different systems - at least based on seeing large and negative values in the stack traces above and in the malloc errors.

imatiach-msft commented 5 years ago

I'm seeing the same error with jni:strict on linux, not sure how to fix it yet though

FATAL ERROR in native method: Non-array passed to JNI array operations
    at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
    at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
    at program.generateSparseDataset(program.java:28)
    at program.main(program.java:60)
chris-smith-zocdoc commented 4 years ago

@imatiach-msft Have you tried reproducing this again after adding the JNI checks? https://github.com/microsoft/LightGBM/issues/2360

would be interesting to see if we get any better errors

StrikerRUS commented 4 years ago

@chris-smith-zocdoc Are you still facing this error?

On the latest master of both LightGBM + mmlspark I am no longer seeing crashes. I'll re-open if they appear again https://github.com/microsoft/LightGBM/issues/2392#issuecomment-540001969

chris-smith-zocdoc commented 4 years ago

I've been using dense vectors to avoid this method for my current work, I'll try it again later and report back

StrikerRUS commented 4 years ago

@chris-smith-zocdoc Any news?

chris-smith-zocdoc commented 4 years ago

I'm still seeing this bug https://github.com/microsoft/LightGBM/issues/2392#issuecomment-545192286 so I'd be hard to tell them apart at this point

guolinke commented 4 years ago

@imatiach-msft is this fixed? Is there anything I can help?