microsoft / vscode-jupyter

VS Code Jupyter extension
https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter
MIT License
1.28k stars 286 forks source link

"Show variable in Dataframe" error (only when the variable is a spark dataframe) #2322

Closed LaurentEsingle closed 3 years ago

LaurentEsingle commented 4 years ago

Bug: Notebook Editor, Interactive Window, Editor cells

Steps to cause the bug to occur

  1. set a virtual env (python interpreter)
  2. specify a remote jupyter server
  3. open a ipynb file
  4. execute some code that creates a spark dataframe
  5. in the varaibles pane, select the dataframe variable name and click on "Show variable in dataviewer"

Actual behavior

Nothing is shown. We get the error below instead.

Error: Failure during variable extraction: --------------------------------------------------------------------------- KeyError Traceback (most recent call last) in 50 # the column names and types from the json so we match what we'll fetch when 51 # we ask for all of the rows ---> 52 if _VSCODE_targetVariable['rowCount']: 53 try: 54 _VSCODE_row = _VSCODE_df.iloc[0:1] KeyError: 'rowCount'

Expected behavior

Show a grid pane with the content of the rows of the Dataframe. Also it seems to happen with Spark Dataframes only (Pandas Dataframes are ok)

Your Jupyter and/or Python environment

Please provide as much info as you readily know

Developer Tools Console Output

52 if _VSCODE_targetVariable['rowCount']: 53 try: 54 _VSCODE_row = _VSCODE_df.iloc[0:1] KeyError: 'rowCount' t.log @ console.ts:137 $logExtensionHostMessage @ mainThreadConsole.ts:39 _doInvokeHandler @ rpcProtocol.ts:398 _invokeHandler @ rpcProtocol.ts:383 _receiveRequest @ rpcProtocol.ts:299 _receiveOneMessage @ rpcProtocol.ts:226 (anonymous) @ rpcProtocol.ts:101 fire @ event.ts:581 fire @ ipc.net.ts:453 _receiveMessage @ ipc.net.ts:733 (anonymous) @ ipc.net.ts:592 fire @ event.ts:581 acceptChunk @ ipc.net.ts:239 (anonymous) @ ipc.net.ts:200 t @ ipc.net.ts:28 emit @ events.js:200 addChunk @ _stream_readable.js:294 readableAddChunk @ _stream_readable.js:275 Readable.push @ _stream_readable.js:210 onStreamRead @ internal/stream_base_commons.js:166 notificationsAlerts.ts:40 Error: Failure during variable extraction: --------------------------------------------------------------------------- KeyError Traceback (most recent call last) in 50 # the column names and types from the json so we match what we'll fetch when 51 # we ask for all of the rows ---> 52 if _VSCODE_targetVariable['rowCount']: 53 try: 54 _VSCODE_row = _VSCODE_df.iloc[0:1] KeyError: 'rowCount' onDidNotificationChange @ notificationsAlerts.ts:40 (anonymous) @ notificationsAlerts.ts:26 fire @ event.ts:581 addNotification @ notifications.ts:171 notify @ notificationService.ts:101 (anonymous) @ mainThreadMessageService.ts:83 _showMessage @ mainThreadMessageService.ts:44 $showMessage @ mainThreadMessageService.ts:38 _doInvokeHandler @ rpcProtocol.ts:398 _invokeHandler @ rpcProtocol.ts:383 _receiveRequest @ rpcProtocol.ts:299 _receiveOneMessage @ rpcProtocol.ts:226 (anonymous) @ rpcProtocol.ts:101 fire @ event.ts:581 fire @ ipc.net.ts:453 _receiveMessage @ ipc.net.ts:733 (anonymous) @ ipc.net.ts:592 fire @ event.ts:581 acceptChunk @ ipc.net.ts:239 (anonymous) @ ipc.net.ts:200 t @ ipc.net.ts:28 emit @ events.js:200 addChunk @ _stream_readable.js:294 readableAddChunk @ _stream_readable.js:275 Readable.push @ _stream_readable.js:210 onStreamRead @ internal/stream_base_commons.js:166 console.ts:137 [Extension Host] Info Python Extension: 2020-02-05 19:38:54: Cached data exists getEnvironmentVariables, tasks console.ts:137 [Extension Host] Info Python Extension: 2020-02-05 19:38:58: Cached data exists getEnvironmentVariables, tasks ------------------------------------------------------------------------------------------------->

Microsoft Data Science for VS Code Engineering Team: @rchiodo, @IanMatthewHuff, @DavidKutu, @DonJayamanne, @greazer

rchiodo commented 4 years ago

This looks to be a bug in our analysis of the DF.

Could you possibly include some sample code that repos?

Thanks

LaurentEsingle commented 4 years ago

Example:

dfTest = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))

dfTest.show()

+---+----+ | id| v| +---+----+ | 1| 1.0| | 1| 2.0| | 2| 3.0| | 2| 5.0| | 2|10.0| +---+----+

But when I open the "Variables" tab and click on "Show variable on data viewer" I get the error mentioned earlier. It happens only with Dataframes.

rchiodo commented 4 years ago

Note, spark requires a JVM to run.

rchiodo commented 4 years ago

To run this, I used WSL

conda create jupyter py37 environment conda install pyspark sudo apt install openjdk-8-jdk sudo update-alternatives --config java copy jdk8 directory set JAVA_HOME in /etc/environment to JDK 8

IanMatthewHuff commented 4 years ago

Validated: image.png