Notebooks lose context after a couple of seconds

edgarcutar commented 1 year ago

Something is weird about the notebooks session using a cluster kernel. After a couple of seconds without running any cell, all stored objects and even modules imported are lost.

Is this expected behavior? If not, how would you be able to work around it?

In this print below I ran both cells within 30 seconds and still got this error.

If I run the first two cells immediately and, 30 seconds later, try to run the next one, again I have an issue.

gbrueckl commented 1 year ago

this is definitely not as expected and the the session should live longer than 30 seconds. I could think of your session being cancelled due to high load on the cluster you are using? this is just a guess though

can you also check the Output window for paiqo.databicks-vscode ?

gbrueckl commented 1 year ago

just ran some quick tests and it works as expected in my environment If it were to happen regularly that the session expires after a couple of seconds, more people would have complained yet. So I think it must be related to something specific to your setup - can you share some more information here?

which connection manager are you using? any specifics about the cluster?

edgarcutar commented 1 year ago

this is definitely not as expected and the the session should live longer than 30 seconds. I could think of your session being cancelled due to high load on the cluster you are using? this is just a guess though

That might be the case, however even when I start a single cluster and I'm the only one using it, I still have the same issue. It's worth noting that this never happens if I use the databricks notebook inside databricks.

can you also check the Output window for paiqo.databicks-vscode ?

Something here is fishy

When I run some code, initially things seem to work as expected:

Executing python:
from pyspark.sql.functions import col
from pyspark.sql import Window
import pyspark.sql.functions as f
POST 1.2/commands/execute --> {"clusterId":"1125-170227-90ljrz5n","language":"python","contextId":"9070729930757480042","command":"from pyspark.sql.functions import col\nfrom pyspark.sql import Window\nimport pyspark.sql.functions as f"}
Response: {"id":"c5bd596b-2e57-4225-8566-df295fb1b690"}
GET 1.2/commands/status --> {"clusterId":"1125-170227-90ljrz5n","contextId":"9070729930757480042","commandId":"c5bd596b-2e57-4225-8566-df295fb1b690"}
Response: {"id":"c5bd596b-2e57-4225-8566-df295fb1b690","status":"Finished","results":{"resultType":"text","data":""}}

After some seconds, this happens:

GET 2.0/clusters/list --> null

After this a ridiculously long response appears, that finishes with this one below. I uploaded the full thing here.

POST 1.2/contexts/destroy  --> {"clusterId":"1125-170227-90ljrz5n","contextId":"492590762552339676"}
Response: {"id":"492590762552339676"}

After this all context is lost

edgarcutar commented 1 year ago

which connection manager are you using?

I am not sure how to check this, or what are the possible answers.

As I mentioned, it's hard to believe this is related to the cluster because it works without any issue in the databricks UI. Is there anything that you think I might look at?

gbrueckl commented 1 year ago

so you can ignore the GET 2.0/clusters/list --> null - this comes from the Clusters tab which is refreshed every 30 (or 300) seconds but has nothing to do with your notebook

the problem is that it somehow runs the POST 1.2/contexts/destroy which should not happen. In fact it can only happen if you

restart the Kernel manually
.dispose() is called on the Kernel/Controller

if it were the first issue, you would see additional log messages (which dont appear) so it must be the second case

it might be that the .dispose() is called internally for some reason but I am not sure what could cause this do you have a lot of Kernels? local Python kernels and/or Databricks kernels?

edgarcutar commented 1 year ago

I have only a couple of local Python kernels available, but there are plenty of clusters from Databricks (over 30) that appear as options (even though I only have proper access to 3 or 4 of them).

mohamad-ali-nasser commented 1 year ago

I have the exact same problem, when working in vscode, and I get the error at around 30 second.

null
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?ebec4f77-41ab-43f6-8088-acd84ecb2492)
Error: com.databricks.backend.common.rpc.CommandCancelledException
    at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:103)
    at scala.Option.getOrElse(Option.scala:189)
    at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$2(SequenceExecutionState.scala:103)
    at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$2$adapted(SequenceExecutionState.scala:100)
    at scala.collection.immutable.Range.foreach(Range.scala:158)
    at com.databricks.spark.chauffeur.Sequen

notice the "null" if i run the cell again the variable is no longer defined. and the context is restarted.

Full error:

"name": "Error",
    "message": "com.databricks.backend.common.rpc.CommandCancelledException\n\tat com.databricks.spark.chauffeur.ExecContextState.$anonfun$shutdown$3(ExecContextState.scala:419)\n\tat scala.collection.immutable.List.flatMap(List.scala:366)\n\tat com.databricks.spark.chauffeur.ExecContextState.shutdown(ExecContextState.scala:419)\n\tat com.databricks.spark.chauffeur.ChauffeurState.shutdownContext(ChauffeurState.scala:587)\n\tat com.databricks.spark.chauffeur.ChauffeurState.$anonfun$clearAllRunningState$4(ChauffeurState.scala:556)\n\tat scala.collection.immutable.List.foreach(List.scala:431)\n\tat com.databricks.spark.chauffeur.ChauffeurState.clearAllRunningState(ChauffeurState.scala:554)\n\tat com.databricks.spark.chauffeur.ChauffeurState.processDriverStateChange(ChauffeurState.scala:357)\n\tat com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:918)\n\tat com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1(Chauffeur.scala:168)\n\tat com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1$adapted(Chauffeur.scala:168)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$1(DriverDaemonMonitorImpl.scala:186)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$1$adapted(DriverDaemonMonitorImpl.scala:186)\n\tat scala.collection.immutable.List.foreach(List.scala:431)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.goToStopped(DriverDaemonMonitorImpl.scala:186)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.monitorDriver(DriverDaemonMonitorImpl.scala:329)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$job$1(DriverDaemonMonitorImpl.scala:70)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n\tat com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:544)\n\tat com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:639)\n\tat com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:660)\n\tat com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:401)\n\tat scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)\n\tat com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:158)\n\tat com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:399)\n\tat com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:396)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:335)\n\tat com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:444)\n\tat com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:429)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionTags(SingletonJob.scala:335)\n\tat com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:634)\n\tat com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:553)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperationWithResultTags(SingletonJob.scala:335)\n\tat com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:544)\n\tat com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:514)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperation(SingletonJob.scala:335)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$4(SingletonJob.scala:395)\n\tat com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:401)\n\tat scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)\n\tat com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:158)\n\tat com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:399)\n\tat com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:396)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:335)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$3(SingletonJob.scala:394)\n\tat scala.util.Try$.apply(Try.scala:213)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$1(SingletonJob.scala:393)\n\tat com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:109)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.run(SingletonJob.scala:387)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\n",
    "stack": "Error: com.databricks.backend.common.rpc.CommandCancelledException\n\tat com.databricks.spark.chauffeur.ExecContextState.$anonfun$shutdown$3(ExecContextState.scala:419)\n\tat scala.collection.immutable.List.flatMap(List.scala:366)\n\tat com.databricks.spark.chauffeur.ExecContextState.shutdown(ExecContextState.scala:419)\n\tat com.databricks.spark.chauffeur.ChauffeurState.shutdownContext(ChauffeurState.scala:587)\n\tat com.databricks.spark.chauffeur.ChauffeurState.$anonfun$clearAllRunningState$4(ChauffeurState.scala:556)\n\tat scala.collection.immutable.List.foreach(List.scala:431)\n\tat com.databricks.spark.chauffeur.ChauffeurState.clearAllRunningState(ChauffeurState.scala:554)\n\tat com.databricks.spark.chauffeur.ChauffeurState.processDriverStateChange(ChauffeurState.scala:357)\n\tat com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:918)\n\tat com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1(Chauffeur.scala:168)\n\tat com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1$adapted(Chauffeur.scala:168)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$1(DriverDaemonMonitorImpl.scala:186)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$1$adapted(DriverDaemonMonitorImpl.scala:186)\n\tat scala.collection.immutable.List.foreach(List.scala:431)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.goToStopped(DriverDaemonMonitorImpl.scala:186)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.monitorDriver(DriverDaemonMonitorImpl.scala:329)\n\tat com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$job$1(DriverDaemonMonitorImpl.scala:70)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n\tat com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:544)\n\tat com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:639)\n\tat com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:660)\n\tat com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:401)\n\tat scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)\n\tat com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:158)\n\tat com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:399)\n\tat com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:396)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:335)\n\tat com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:444)\n\tat com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:429)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionTags(SingletonJob.scala:335)\n\tat com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:634)\n\tat com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:553)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperationWithResultTags(SingletonJob.scala:335)\n\tat com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:544)\n\tat com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:514)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperation(SingletonJob.scala:335)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$4(SingletonJob.scala:395)\n\tat com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:401)\n\tat scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)\n\tat com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:158)\n\tat com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:399)\n\tat com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:396)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:335)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$3(SingletonJob.scala:394)\n\tat scala.util.Try$.apply(Try.scala:213)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$1(SingletonJob.scala:393)\n\tat com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:109)\n\tat com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.run(SingletonJob.scala:387)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\n\n    at l._doExecution (c:\\Users\\mohamadalinasser\\.vscode\\extensions\\paiqo.databricks-vscode-1.4.1\\dist\\node\\extension.js:2:59031)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at l.executeHandler (c:\\Users\\mohamadalinasser\\.vscode\\extensions\\paiqo.databricks-vscode-1.4.1\\dist\\node\\extension.js:2:54338)\n    at C.$executeCells (c:\\Users\\mohamadalinasser\\AppData\\Local\\Programs\\Microsoft VS Code\\resources\\app\\out\\vs\\workbench\\api\\node\\extensionHostProcess.js:98:91037)"
}

gbrueckl commented 1 year ago

seems like the root cause is some huge resultset/output that you are trying to display - can you give me a rough estimate of how big it is (num cols * num rows) - I may be able to reproduce the issue on my side then

it could be that this leads to the disposal of the Notebook controller and the removal of the Execution context

mohamad-ali-nasser commented 1 year ago

In this case i was running .head(), so 5 lines and maybe 5 columns, but it doesn't matter in this case because the cell could have easily ran and the next cell get the error. the same .head() worked in the previous cell. I can even get the when defining a variable, and no output is expected. The output here Output exceeds the [size limit] here refers to the error, it was truncated, not to the expected output of the cell.

mohamad-ali-nasser commented 1 year ago

When looking at the standard error in driver logs, the error is this


 An error occurred while calling o2008.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:472)
    at org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:107)
    at org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:103)
    at sun.reflect.GeneratedMethodAccessor577.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:306)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job 75 cancelled part of cancelled job group 6483750312242770568_8616252881159588369_85583ccb-2097-4890-a081-4a619eb44532
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3313)
    at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:3173)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleJobGroupCancelled$4(DAGScheduler.scala:1385)
    at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.spark.scheduler.DAGScheduler.handleJobGroupCancelled(DAGScheduler.scala:1384)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3483)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3462)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3450)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)

  warn(msg)

KeyboardInterrupt

Wed Feb  8 14:39:47 2023 Connection to spark from PID  6031
Wed Feb  8 14:39:47 2023 Initialized gateway on port 41929
Wed Feb  8 14:39:47 2023 Connected to spark.
``

mohamad-ali-nasser commented 1 year ago

So after a couple of days of trying, it sometimes work like a charm and sometimes it restarts the context. One thing I did was make sure that the offline and online notebooks are consistent, but recently the error returned, looking at the event log, it seems that it gets restarted every time it tries to autoscale up. I noticed that after i added some init_scripts, which you can see running every time it autoscales which means it was restarted. When using the same notebook on azure databricks I do not face this problem. @edgarcutar, can you check if you have the same pattern? Screenshot_157

gbrueckl commented 1 year ago

this can very well be the reason! good catch! I am just afraid that there is nothing we can do about the issue than from a perspective of this VSCode extension. We are basically just using the Databricks Execution APIs and if they cannot handle Autoscaling, we also cannot do it properly

mohamad-ali-nasser commented 1 year ago

Okay, I don't have the technical ability to check if there is a solution for this when looking at the different scripts where autoscaling is mentioned (such as this one DatabricksKernel.ts). But it could be that it needs to constantly fetch the cluster information (new target_workers?), or the connection is being killed from the extension side while its waiting for a response. I am sure the API can handle Autoscaling, but there might be some kind of handling error when this happens. The same behavior might happen (init script) when autoscaling using azure databricks notebook, but it doesn't lose notebook context, so another venue to check is where the notebook execution contexts is being saved and loaded (again not sure if this is a thing) Its frustrating not to be able to use this awesome extension as autoscaling is at the core of databricks, if you do find a solution, please let me know. Thanks!

joshdunnlime commented 5 months ago

Hi, I am also having an issue with a cluster that disconnects. This isn't an issue with any of my other clusters. As far as I can tell, the only difference between this cluster and the other clusters is Autoscaling. It seems to be worse the longer I leave the notebook sitting there without another execution. This makes me think something like (excuse the simplistic terminology) pinging the cluster would help solve this.

Keen to hear your thoughts or feedback on this. It is really quite frustrating and would love a solution (definitely not a criticism - Databricks Power Tools is AMAZING!!).

gbrueckl commented 5 months ago

I will have another look - but no promise I can fix this or this can be fixed at all using the current approach/APIs

joshdunnlime commented 5 months ago

Yes - I totally understand and appreciate that. You can only work with what DataBricks exposes! Thanks, appreciated it!

paiqo / Databricks-VSCode

Notebooks lose context after a couple of seconds #132