saurfang / spark-knn

k-Nearest Neighbors algorithm on Spark
Apache License 2.0
233 stars 113 forks source link

knn.fit(training) throws an exception #32

Open akshaybhatt14495 opened 6 years ago

akshaybhatt14495 commented 6 years ago

followed whatever was there val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() val knn = new KNNClassifier() .setTopTreeSize(training.count().toInt / 500) .setK(10) 1st error : TopTreeSize is invalid 0 (since total count of training sample is 100) let say we set manually TreeSize as 1 then it throws an exception while running knn.fit(training)

java.util.NoSuchElementException: Failed to find a default value for inputCols at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:658) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)

kaushikacharya commented 6 years ago

Hi, If you at the example: https://github.com/saurfang/spark-knn/blob/master/spark-knn-examples/src/main/scala/com/github/saurfang/spark/ml/knn/examples/MNIST.scala

For KNNClassifier object it sets the two column names i.e. features, prediction

.setFeaturesCol("pcaFeatures") .setPredictionCol("predicted") These seems to be missing in your case.

On Tue, Jan 9, 2018 at 6:35 PM, akshaybhatt14495 notifications@github.com wrote:

followed whatever was there val training = MLUtils.loadLibSVMFile(sc, "data/mllib/samplelibsvm data.txt").toDF() val knn = new KNNClassifier() .setTopTreeSize(training.count().toInt / 500) .setK(10) TopTreeSize is invalid 0 (since total count of training sample is 100) let say we set manually TreeSize as 1 then it throws an exception while running knn.fit(training)

java.util.NoSuchElementException: Failed to find a default value for inputCols at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$ 2.apply(params.scala:652) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$ 2.apply(params.scala:652) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:658) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfs5XmsuVtOzeSTxLB34e5uUlos32sks5tI2QDgaJpZM4RXy9_ .

akshaybhatt14495 commented 6 years ago

@kaushikacharya thanks for response, actually i need k nearest neighbors (KNN) , so for that do we need classification in dataset (i.e. first entry in each case as 0 or 1)??

akshaybhatt14495 commented 6 years ago

@kaushikacharya i'm talking about KNN.scala

akshaybhatt14495 commented 6 years ago

Got another error in command knn.fit(training)

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51) at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)

kaushikacharya commented 6 years ago

Which spark version are you using?

These might be helpful for resolving the ml vs mllib error:

https://stackoverflow.com/questions/38901123/how-convert-ml-vectorudt-features-from-mllib-to-ml-type

https://spark.apache.org/docs/2.1.0/ml-migration-guides.html "While most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types. Utilities for converting DataFrame columns from spark.mllib.linalg to spark.ml.linalg types (and vice versa) can be found in spark.mllib.util.MLUtils."

On Wed, Jan 10, 2018 at 11:42 AM, akshaybhatt14495 <notifications@github.com

wrote:

Got another error in command knn.fit(training)

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType( SchemaUtils.scala:42) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema( Predictor.scala:51) at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$ classification$ClassifierParams$$super$validateAndTransformSchema( Classifier.scala:58)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32#issuecomment-356509657, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfsza37M_ilA73w7wDmrhCp4Zj3sOBks5tJFSwgaJpZM4RXy9_ .

akshaybhatt14495 commented 6 years ago

@kaushikacharya spark version is 2.2.0

kaushikacharya commented 6 years ago

Have a look at https://github.com/saurfang/spark-knn/blob/master/project/Dependencies.scala val sparktest = "org.apache.spark" %% "spark-core" % "2.1.0" % "test" classifier "tests"

Also in build.sbt you can see commonSettings which is defined in Common.scala This mentions: sparkVersion := "2.1.0",

My understanding is that this repository is updated for spark 2.1.0 You using 2.2.0 could be the reason for the errors which you are facing.

akshaybhatt14495 commented 6 years ago

i changed my version and now working with spark 2.1.0, then also got same error,

akshaybhatt14495 commented 6 years ago

Ok, i used MLUtils function convertVectorColumnsFromML(training, "features") so then got new error for sample data given in sample_libsvm_data.txt

java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.01) must be on interval [0, 1] at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:147) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:496) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:491)

kaushikacharya commented 6 years ago

You are facing the same issue as: https://github.com/saurfang/spark-knn/issues/21

Your error says that: Sampling fraction (1.01) must be on interval [0, 1]

sampling fraction needs to be <= 1

I would suggest first try running on mnist data (mnist.bz2) from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/ Put this data in your data folder and run the mnist scala example.

On Thu, Jan 11, 2018 at 10:43 AM, akshaybhatt14495 <notifications@github.com

wrote:

Ok, i used MLUtils function convertVectorColumnsFromML(training, "features") so then got new error for sample data given in sample_libsvm_data.txt

java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.01) must be on interval [0, 1] at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:147) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:496) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:491)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32#issuecomment-356827852, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfsxNAhZFMh4jkQg9lKEJQprHF742Xks5tJZiUgaJpZM4RXy9_ .