microsoft / Purview-ADB-Lineage-Solution-Accelerator

A connector to ingest Azure Databricks lineage into Microsoft Purview
MIT License
90 stars 55 forks source link

Lineage missing for notebooks | #193

Closed Kishor-Radhakrishnan closed 1 year ago

Kishor-Radhakrishnan commented 1 year ago

We are missing lineage info for few notebooks. I am pasting error from log4j and attaching logs

23/04/04 13:00:14 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: Field is not instance of HadoopMapRedWriteConfigUtil at io.openlineage.spark.agent.lifecycle.RddExecutionContext.lambda$setActiveJob$0(RddExecutionContext.java:117) at java.util.Optional.orElseThrow(Optional.java:290) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.setActiveJob(RddExecutionContext.java:115) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$9(OpenLineageSparkListener.java:168) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:165) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1623) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) 23/04/04 13:00:14 INFO RddExecutionContext: Found job conf from RDD Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-rbf-default.xml, hdfs-site.xml, hdfs-rbf-site.xml 23/04/04 13:00:14 INFO RddExecutionContext: Found output path null from RDD MapPartitionsRDD[16] at $anonfun$execute$3 at FrameProfiler.scala:80 23/04/04 13:00:14 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event

log4j-active (4).txt

hmoazam commented 1 year ago

Thank you for sharing the logs! Looking into them, it seems the source of the error is the function app you're pointing to. For example, on line 4307 of the attached log4j output, you'll see the error message: ERROR EventEmitter: Could not emit lineage w/ exception java.net.UnknownHostException: functionappkae2.azurewebsites.net: Name or service not known.

Can you ensure your function app is running and accessible from the databricks workspace? Please see the following section in our troubleshooting guide for details on what to check.

Kishor-Radhakrishnan commented 1 year ago

identified the issue . Pls close issue