linkedin / isolation-forest

A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm with support for exporting in ONNX format.
Other
223 stars 47 forks source link

Unable to save model #33

Closed DnyaneshPatil23 closed 1 year ago

DnyaneshPatil23 commented 2 years ago

I am using spark 3.1 and Scala 2.12. I am using below isolation forest model artifact in maven.

com.linkedin.isolation-forest isolation-forest_3.0.0_2.12

Recently I started getting below error

java.lang.NoClassDefFoundError: org/json4s/JsonAssoc$ at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:239) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)

Below is our code.

def generateAnomalyScoreUsingIsolationForest(spark: SparkSession, year: String, month: String, day: String): Unit = {

    spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", "true")

    val model_path = f"/iforest_$year%s_$month%s_$day%s.model"
    val data_path = f"/anomalyScores_$year%s_$month%s_$day%s.parquet/"

    val df_final_table = spark.sql("select * from AppFeatures_v2")
    val cols = df_final_table.columns
    val labelCol = cols.slice(0,1).mkString("")

    val assembler = new VectorAssembler().setInputCols(cols.slice(1, cols.length)).setOutputCol("features")

    val data = assembler.transform(df_final_table).select(col("features"), col(labelCol).as("label"))

    val contamination = 0.002
    val max_samples = 0.3
    val max_features = 0.4
    val num_estimator = 1000

    val isolationForest = (new IsolationForest()
            .setNumEstimators(num_estimator)
            .setBootstrap(false)
            .setMaxSamples(max_samples)
            .setMaxFeatures(max_features)
            .setFeaturesCol("features")
            .setPredictionCol("predictedLabel")
            .setScoreCol("outlierScore")
            .setContamination(contamination)
            .setContaminationError(0.01 * contamination)
            .setRandomSeed(21))

    val isolationForestModel = isolationForest.fit(data)

    val dataWithScores = isolationForestModel.transform(data)

   // Failing on below line
    isolationForestModel.write.overwrite().save("/iforest_latest.model")
    isolationForestModel.write.overwrite().save(model_path)

    dataWithScores.select("label", "predictedLabel","outlierScore").write.mode("overwrite").option("overwriteSchema", "true").parquet(data_path)
}

It was working till couple of weeks ago. Can anyone help to solve this problem?

jverbus commented 2 years ago

You mentioned that it was working until several weeks ago. Has anything changed on your side (e.g., Spark / Scala versions used on your cluster)?

There are several json issues with model I/O reported and solved in prior tickets. I'd suggest taking a look at these and seeing if any are relevant.

There are isolation-forest artifacts built for Spark 3.1.1 and Scala 2.12 (Maven Central). I'd suggest using a version that matches your environment.

jverbus commented 1 year ago

Closing this as there have been no replies for several months.