Open akshaybhatt14495 opened 6 years ago
Hi, If you at the example: https://github.com/saurfang/spark-knn/blob/master/spark-knn-examples/src/main/scala/com/github/saurfang/spark/ml/knn/examples/MNIST.scala
For KNNClassifier object it sets the two column names i.e. features, prediction
.setFeaturesCol("pcaFeatures") .setPredictionCol("predicted") These seems to be missing in your case.
On Tue, Jan 9, 2018 at 6:35 PM, akshaybhatt14495 notifications@github.com wrote:
followed whatever was there val training = MLUtils.loadLibSVMFile(sc, "data/mllib/samplelibsvm data.txt").toDF() val knn = new KNNClassifier() .setTopTreeSize(training.count().toInt / 500) .setK(10) TopTreeSize is invalid 0 (since total count of training sample is 100) let say we set manually TreeSize as 1 then it throws an exception while running knn.fit(training)
java.util.NoSuchElementException: Failed to find a default value for inputCols at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$ 2.apply(params.scala:652) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$ 2.apply(params.scala:652) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:658) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfs5XmsuVtOzeSTxLB34e5uUlos32sks5tI2QDgaJpZM4RXy9_ .
@kaushikacharya thanks for response, actually i need k nearest neighbors (KNN) , so for that do we need classification in dataset (i.e. first entry in each case as 0 or 1)??
@kaushikacharya i'm talking about KNN.scala
Got another error in command knn.fit(training)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51) at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
Which spark version are you using?
These might be helpful for resolving the ml vs mllib error:
https://stackoverflow.com/questions/38901123/how-convert-ml-vectorudt-features-from-mllib-to-ml-type
https://spark.apache.org/docs/2.1.0/ml-migration-guides.html "While most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types. Utilities for converting DataFrame columns from spark.mllib.linalg to spark.ml.linalg types (and vice versa) can be found in spark.mllib.util.MLUtils."
On Wed, Jan 10, 2018 at 11:42 AM, akshaybhatt14495 <notifications@github.com
wrote:
Got another error in command knn.fit(training)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType( SchemaUtils.scala:42) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema( Predictor.scala:51) at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$ classification$ClassifierParams$$super$validateAndTransformSchema( Classifier.scala:58)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32#issuecomment-356509657, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfsza37M_ilA73w7wDmrhCp4Zj3sOBks5tJFSwgaJpZM4RXy9_ .
@kaushikacharya spark version is 2.2.0
Have a look at https://github.com/saurfang/spark-knn/blob/master/project/Dependencies.scala val sparktest = "org.apache.spark" %% "spark-core" % "2.1.0" % "test" classifier "tests"
Also in build.sbt you can see commonSettings which is defined in Common.scala This mentions: sparkVersion := "2.1.0",
My understanding is that this repository is updated for spark 2.1.0 You using 2.2.0 could be the reason for the errors which you are facing.
i changed my version and now working with spark 2.1.0, then also got same error,
Ok, i used MLUtils function convertVectorColumnsFromML(training, "features") so then got new error for sample data given in sample_libsvm_data.txt
java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.01) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.util.random.BernoulliSampler.
You are facing the same issue as: https://github.com/saurfang/spark-knn/issues/21
Your error says that: Sampling fraction (1.01) must be on interval [0, 1]
sampling fraction needs to be <= 1
I would suggest first try running on mnist data (mnist.bz2) from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/ Put this data in your data folder and run the mnist scala example.
On Thu, Jan 11, 2018 at 10:43 AM, akshaybhatt14495 <notifications@github.com
wrote:
Ok, i used MLUtils function convertVectorColumnsFromML(training, "features") so then got new error for sample data given in sample_libsvm_data.txt
java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.01) must be on interval [0, 1] at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:147) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:496) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:491)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32#issuecomment-356827852, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfsxNAhZFMh4jkQg9lKEJQprHF742Xks5tJZiUgaJpZM4RXy9_ .
followed whatever was there val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() val knn = new KNNClassifier() .setTopTreeSize(training.count().toInt / 500) .setK(10) 1st error : TopTreeSize is invalid 0 (since total count of training sample is 100) let say we set manually TreeSize as 1 then it throws an exception while running knn.fit(training)
java.util.NoSuchElementException: Failed to find a default value for inputCols at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:658) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)