Closed Mikkomario closed 3 months ago
It should be related with data schema. Not sure how this code work
val (data, testingData, context) = collectTrainingData(...)
So it is impossible for me to reproduce the issue even with your data file. Nominal are stored as integer (with string representation). Note that nominal starts with 0. I suspect that your data
values are not in range [0, k)
, where k is the levels of nominal values.
Your suspection would be correct. The javadoc documentation between the relationship of levels and values was short, so I didn't realize to make that distinction. Thank you for your insights.
I just confirmed that you were correct. The cause of this error seems to be the fact that the nominal values were not assigned correctly. It might be useful to include some indication for this requirement in the javadoc, or alternatively some sort of check / IllegalArgumentException, etc. Thank you very much for your help, @haifengl.
Describe the bug When training a Gradient Boosted Regression Tree model using
gmb(...)
, the function rhrows, as it attempts to access a non-existing index in an array.Expected behavior I expected the function to run normally and to complete the training. In case the input data is faulty, I'd have expected the function to throw an IllegalArgumentException or something.
Actual behavior gmb throws an exception. Here's the stack trace:
The stack trace is referring to this line in RegressionTree.java:
trueCount[idx] += samples[o];
Code snippet Here's the code that produced this error. I removed additional prints, etc. The training data was read from a local database. I've attached a csv file generated using
Write(df, ...)
.Here's the used StructType as a String, for reference:
Here's a code where I perform the same function with the same data, but with manually created StructType instance and a DataFrame read from a csv file. But: This doesn't reproduce the error.
Could the issue be related to the DataFrame instance, somehow? df.toString and df.summary yield the same results on both codes. However, in the original code, the nominal values (airline, destination, aircraftType) are represented with integers. In this csv-based code they are represented with Strings only.
I used this code to construct the original NominalScale instances:
These use a StoredCode class, which maps a String code to a Int database row id.
Input data The DataFrame instance (df) used in the above code is attached as a separate csv file: dataframe-censored.csv
Additional context Java version: 1.8.0_402 from OpenJDK Scala version: 2.13.14 SMILE version: 3.1.1 OS: Linux