hortonworks-spark / spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas
Apache License 2.0
263 stars 149 forks source link

java.lang.NoClassDefFoundError: org/apache/atlas/ApplicationProperties #299

Open terriblegirl opened 4 years ago

terriblegirl commented 4 years ago

spark version 2.4.5 atlas version 2.0.0 use maven execute mvn package -DskipTests successful!! this screenshot image copy 1100-spark_model.json to /models/1000-Hadoop

execute
spark-shell --jars spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar \ --conf spark.extraListeners = com.hortonworks.spark.atlas.SparkAtlasEventTracker \ --conf spark.sql.queryExecutionListeners = com.hortonworks.spark.atlas.SparkAtlasEventTracker \ --conf spark.sql.streaming.streamingQueryListeners = com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker

image but I compiled successfully! why it says java.lang.NoClassDefFoundError: org/apache/atlas/ApplicationProperties what can I do?

shivsood commented 3 years ago

Looks like you missed supplying the application properties file.

dhineshns commented 3 years ago

any updates on this?

YanXiangSong commented 3 years ago

This is due to a missing jar package. You can use the /spark-atlas-connector-assembly/target directory of the spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar But I'm having trouble with this one. java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at com.hortonworks.spark.atlas.AtlasClientConf.get(AtlasClientConf.scala:50) at com.hortonworks.spark.atlas.types.AtlasEntityUtils$class.clusterName(AtlasEntityUtils.scala:29) at com.hortonworks.spark.atlas.sql.CommandsHarvester$.clusterName(CommandsHarvester.scala:45) at com.hortonworks.spark.atlas.types.AtlasEntityUtils$class.tableToEntity(AtlasEntityUtils.scala:60) at com.hortonworks.spark.atlas.sql.CommandsHarvester$.tableToEntity(CommandsHarvester.scala:45) at com.hortonworks.spark.atlas.sql.CommandsHarvester$InsertIntoHiveTableHarvester$.harvest(CommandsHarvester.scala:56) at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:126) at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89) at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63) at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72) at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:71) at scala.Option.foreach(Option.scala:257) at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:71) at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:38)

kennydataml commented 3 years ago

Looks like you missed supplying the application properties file.

This is partially correct. As per the readme, the atlas-application.properties needs to be discoverable by spark. ie - needs to be in classpath (if cluster mode, use --files to ship to executor).

You also need to either

  1. provide the apache atlas jars (atlas-intg) to the spark submit (as well as many other jar dependencies)
  2. use the fat jar under spark-atlas-connector-assembly/target

NOTE: I am trying to make this work in Azure Databricks, which requires an init script.

I am only using the RestAtlasClient.scala. This leverages AtlasClientConf.scala which uses ApplicationProperties.java Take a look at the ApplicationProperties.java in atlas repo. You can see that if ATLAS_CONFIGURATION_DIRECTORY_PROPERTY == null then it will search under the classpath using ApplicationProperties.class.getClassLoader() which seems to be completely useless because that falls under the webapp section of Atlas.
So that means there's an assumption that spark workloads are running on the same VM as atlas web app? This is unclear to me.

If you look at the static variable of the ApplicationProperties class, you can see that ATLAS_CONFIGURATION_DIRECTORY_PROPERTY is set to java system property "atlas.conf". This stackoverflow post has the comment showing that if you set System.setProperty("atlas.conf", "<path to your properties>") in your spark job, then it will work.

Spark Conf

extra class path (not working)

I've tried setting the following spark conf options during spark-submit:

I tried multiple variations of folder paths, using the name of the file, not using the name of the file, using local:/folderpath, etc.
This does not work.
Log output:

21/03/30 18:54:46 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
21/03/30 18:54:46 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
21/03/30 18:54:46 INFO ApplicationProperties: Loading atlas-application.properties from null

Summarized error:

ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Exception when registering SparkListener
...
Caused by: org.apache.atlas.AtlasException: Failed to load application properties
...
Caused by: org.apache.commons.configuration.ConfigurationException: Cannot locate configuration source null

We can see that the url variable is null.

extra java options (working)

I then tried setting Java system properties. specifically atlas.conf. There are 2 ways to do this:

  1. using spark-defaults.conf. The default Spark properties file is $SPARK_HOME/conf/spark-defaults.conf
  2. --conf "spark.driver.extraJavaOptions=-Datlas.conf=path/to/properties-folder/" 
    --conf "spark.executor.extraJavaOptions=-Datlas.conf=path/to/properties-folder/"

I opted for using --conf which worked successfully.

Modified source code

I ended up setting the System property (tied to environment variable) within the class constructor of AtlasClientConf and object AtlasClienctConf This didn't work either. Setting Java system parameters in Spark conf is the solution.