locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
244 stars 45 forks source link

How to resolve the error “ML algorithm was given empty dataset.”? why the array is “[0, 1, 8, 9, 10]” in the codes of “rf_local_is_in('scl', [0, 1, 8, 9, 10])”? #542

Closed JenniferYingyiWu2020 closed 3 years ago

JenniferYingyiWu2020 commented 3 years ago

Hi, Now, I need to construct my own data set for supervised machine learning, however, after my execution, the errors “org.apache.spark.SparkException: ML algorithm was given empty dataset.” took place. Before that, the codes of supervised machine learning can run successfully on the data set of “eleven bands of 60 meter resolution Sentinel-2 imagery” (https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/spatial). Because of my hopes on run the codes on my own data set, I have got 11 bands “.tiff” of image set. Also, I generate the “SCL.tif” (scene classification (SCL) data), after my carefully reading on Scene Classification (SC) of Level-2A Algorithm (https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). However, the following errors happened: 1

    Moreover, I have noticed the codes “rf_local_is_in('scl', [0, 1, 8, 9, 10])”, could you pls tell me why the integer array is “[0, 1, 8, 9, 10]”? If I use my own data set, then how to define the above array?

2

My SCL.tif looks like below: 3

    So, could you pls help to give me some suggestions on how to resolve the error “ML algorithm was given empty dataset.”? Also, could you pls help to explain why the array is “[0, 1, 8, 9, 10]” in the codes of “rf_local_is_in('scl', [0, 1, 8, 9, 10])”? Thanks!
JenniferYingyiWu2020 commented 3 years ago

Hi, My own image data set for "supervised machine learning" is "https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/tree/main/image-dataset/20200613clip". At the same time, I have modified the codes of "supervised machine learning" (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/machine-learning/supervised_machine_learning.py). Moreover, the error logs after running "supervised machine learning" is "https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/error-logs/ML_algorithm_given_empty_dataset.log". So, could you pls help to give me suggestions on how to resolve the error "ML algorithm was given empty dataset."? Thanks!

JenniferYingyiWu2020 commented 3 years ago

Hi, When the "supervised machine learning" (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/machine-learning/supervised_machine_learning.py) run on my own data set (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/tree/main/image-dataset/20200613clip), the error "ML algorithm was given empty dataset" still exist. I am so confused about it. So, could you pls help to give me some suggestions? Thanks! I have searched the error "ML algorithm was given empty dataset." from Google, then I found the exception message on the page of "https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala". 6

JenniferYingyiWu2020 commented 3 years ago

One more point is added here: after the codes execute on the line "model = pipeline.fit(model_input)", the error stack is below: ERROR Instrumentation: org.apache.spark.SparkException: ML algorithm was given empty dataset. at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:113) at org.apache.spark.ml.classification.DecisionTreeClassifier

KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲train$1.apply(D…: anonfun$train$1.apply(DecisionTreeClassifier.scala:106) at org.apache.spark.ml.classification.DecisionTreeClassifier

anonfun$train$1.apply(DecisionTreeClassifier.scala:101) at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:185) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:185) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:46) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at org.apache.spark.ml.Predictor.fit(Predictor.scala:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

JenniferYingyiWu2020 commented 3 years ago

The issue has been resolved by my effort, so I will close it.

dcooper46 commented 1 month ago

what effort did you take to resolve the issue? What was the root cause?