Featurizer should provide option to pass through missing values as Double.NaN instead of removing rows (currently the default)

ekaterina-sereda-rf commented 6 years ago

Hi! Using lightGBM I faced another problem. I'm not sure if it is bug or feature :) but in our data we have a lot of empty values, so before we used sparse vector to store features, and it worked fine with our previous lib. But when i tried to use featurizer, that you provide - i mentioned, that you skip all raws if any nulls are presents as a feature. you can see it in example in attachment. So is it possible to have sparse feature vector for lightGBM training?

https://gist.github.com/ekaterina-sereda-rf/929183b9bcbbf5baf15eec3e81329992

imatiach-msft commented 6 years ago

@ekaterina-sereda-rf thanks for bringing this issue up -- yes, the featurizer skips all rows with missing values in the featurizeData method - you can just replace missings with zero values in order to skip them using the CleanMissingData transform we have: https://github.com/Azure/mmlspark/blob/master/src/clean-missing-data/src/main/scala/CleanMissingData.scala

Here's an example pyspark notebook that uses it: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/104%20-%20Price%20Prediction%20Regression%20Auto%20Imports.ipynb

or you could use spark's operations directly to remove missing values from columns (for example using .na.drop or something similar). Would that resolve the issue? Otherwise, what would you prefer to have the featurizer do -- would you want to have an extra parameter to clean out missings directly, or clean missings by default, or should it just error out on missing values?

troszok commented 6 years ago

@imatiach-msft thanks for fast response. I work together with @ekaterina-sereda-rf on that. I think the way MMLSpark/LightGBM handles sparse data can be more explicit.

Print some warning with the reason the data was dropped (this can be time consuming if the dataset is huge, so maybe just print it the first time?)
Make it more explicit in the documentation how the data needs to be prepared

We had following errors that were coming probably from the fact that only ~3% of our rows doesn't have any NAs so after LightGBMUtil.featurize made some of the partitions empty:

Caused by: java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:27)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:203)
at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$1.apply(LightGBMRegressor.scala:77)
at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$1.apply(LightGBMRegressor.scala:77)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Is there any particular reason why the NaNs/nulls are not available in LightGBM/MMLSpark? I thought that LightGBM supports it: https://github.com/Microsoft/LightGBM/blob/master/docs/Advanced-Topics.rst

We still want to train on the whole dataset (so we do not want to drop anything), and there is a huge difference in our data between missing value and 0 so we want to differentiate those two cases.

imatiach-msft commented 6 years ago

@troszok I see, yes, we can pass nulls as double.NaN to LightGBM directly based on that documentation file. It makes sense that the algorithm should be able to handle missings. In fact, if you don't use my featurizer you should be able to do this directly (you just need to create a vector using VectorAssembler from spark directly for the features column). Please note LightGBM learner itself has the same interface as other spark ML learners, the LightGBMUtil.featurize method is just a utility I built for convenience, it is not necessary. It would be nice to add the option to treat missings as NaNs in the featurizer. In LightGBM classifier I believe this shouldn't be a problem though, as long as all null values are replaced with double.NaN.

troszok commented 6 years ago

@imatiach-msft great! Thank you for the explanation - because of all the examples I thought that this is the expected way to use Lightgbm in MMLSpark and there are some limitations. I think we can close this issue now.

imatiach-msft commented 6 years ago

@troszok this sounds like a limitation in the featurizer we provide: https://github.com/Azure/mmlspark/blob/master/src/featurize/src/main/scala/Featurize.scala Perhaps we need an additional option there - so maybe we can keep the issue open for that. This isn't related to LightGBM learner though. I'll update the title.

microsoft / SynapseML

Featurizer should provide option to pass through missing values as Double.NaN instead of removing rows (currently the default) #304