Open arijeetm1 opened 4 years ago
based on the line number in the stack trace:
at com.microsoft.ml.spark.lightgbm.TrainUtils$$anonfun$3.apply(TrainUtils.scala:29)
it looks like this error is due to this line:
val labels = rows.map(row => row.getDouble(schema.fieldIndex(columnParams.labelColumn)))
it looks like your label column doesn't have doubles. I think this should already be fixed in the latest master by the castColumns method in file LightGBMBase.scala.
interestingly, this would actually violate the schema in your dataset above:
StructType(List(StructField(label,DoubleType,true),StructField(features,VectorUDT,true)))
so I'm not quite sure what is happening there
I came across the same exception when featureFraction is Integer, but should not the root cause for this case. Maybe be some float parameter not allow pass Integer in?
Py4JJavaError Traceback (most recent call last)
@hebo-yang oh this is very interesting. Based on that stack trace it looks like there is some param that is being converted to an int instead of a double in java from python, but the scala code expect it to be a double values param. I think this is something in the pyspark bindings then. I will try to run this with the params you specified on a different dataset to see if I can reproduce.
@imatiach-msft Thanks! Were you able to repro this please?
Finally figured out my error...omg, the scala code expects parameters like alpha and lambdaL2 to be floats. Mine happen to be an int value and thus the error.
Describe the bug java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double on training.
To Reproduce
data.schema StructType(List(StructField(label,DoubleType,true),StructField(features,VectorUDT,true)))
Expected behavior A clear and concise description of what you expected to happen.
Info (please complete the following information):
Stacktrace Spark history server logs:
If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility
Additional context We could rule out this code where we are getDouble from label : https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L98 since value is of type vector.
Looking through the spark sql query plan gives us an possible explanation with labels, where we cast the label to int during projection and then attempt to cast it to double during deserialize which could be related to this issue.