linkedin / isolation-forest

A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm with support for exporting in ONNX format.
Other
223 stars 47 forks source link

The library gives error while writing model using Spark 2.4 #1

Closed bhushanbalki closed 5 years ago

bhushanbalki commented 5 years ago

First of all thanks for making the Isolation Forest library open source. We would like to use this library with Spark 2.4.0. We tried using this library with Spark 2.4 Job but it is giving the error related to json4s while writing the model to HDFS. The error is "Spark with json4s, parse function raise java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;"

We understand the breaking changes are because of Spark 2.4.0 which started using json4s version 3.5.3 while your library uses Spark 2.3 which uses json4s version 3.2.11.

We tried building the Isolation Forest library with Spark 2.4 but it is failing. Can you help us to make this library compatible with Spark 2.4.0? We understand we need to update the scala code. Can you help us with it ?

jverbus commented 5 years ago

Thanks for the interest in the library!

I believe the 2.4 builds are failing because as of 2.4.0 Databricks donated their spark-avro library to Apache Spark.

https://github.com/databricks/spark-avro

This support is now built-in.

https://spark.apache.org/docs/2.4.0/sql-data-sources-avro.html

I was able to get the isolation-forest library to build successfully by changing the dependencies in the module-level build.gradle as follows:

dependencies {
    compile("com.chuusai:shapeless_2.11:2.3.2")
//    compile("com.databricks:spark-avro_2.11:4.0.0")
    compile("org.apache.spark:spark-avro_2.11:2.4.0")
    compile("org.apache.spark:spark-core_2.11:2.4.0")
    compile("org.apache.spark:spark-mllib_2.11:2.4.0")
    compile("org.apache.spark:spark-sql_2.11:2.4.0")
    compile("org.scalatest:scalatest_2.11:2.2.6")
    compile("org.testng:testng:6.8.8")
}

Please let me know if this works for you.

jverbus commented 5 years ago

@bhushanbalki : Did this work for you?

bhushanbalki commented 5 years ago

Yes, it work. Thanks for your input

fabiofabris commented 4 years ago

Hi! Thanks for this thread. I'm facing the same issue. Just to be sure: When commenting compile("com.databricks:spark-avro_2.11:4.0.0"), 5 unit tests fail, all of them with the error "org.apache.spark.sql.AnalysisException: Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;"

I guess this is fine since databricks contains a built-in version of com.databricks.spark.avro, is that correct?

jverbus commented 4 years ago

@fabiofabris: You need to not only comment out the compile("com.databricks:spark-avro_2.11:4.0.0") dependency, but also add the compile("org.apache.spark:spark-avro_2.11:2.4.0" dependency. Please make sure your dependencies are as shown here: https://github.com/linkedin/isolation-forest/issues/1#issuecomment-521930608

prarshah commented 4 years ago

I followed the instructions mentioned in #1 still the error persists.