microsoft / vscode-data-wrangler

Other
433 stars 19 forks source link

Pyspark Dataframe Support #255

Closed 0xbadidea closed 2 months ago

0xbadidea commented 3 months ago

Environment data

Expected behaviour

I'm able to view a Pyspark Dataframe in the Data Viewer.

Actual behaviour

Error message "Could not retrieve variable df_config from the Jupyter extension. Please file an issue on the Data Wrangler GitHub repository." appears instead.

Steps to reproduce:

Debug a pyspark application in VS Code, right click. In the variables panel, right click a dataframe and select "View Value in Data Viewer" image

Output for Jupyter in the Output panel (ViewOutput, change the drop-down the upper-right of the Output panel to Jupyter)

``` Visual Studio Code (1.92.1, dev-container, desktop) Jupyter Extension Version: 2024.7.0. Python Extension Version: 2024.12.2. Pylance Extension Version: 2024.8.1. Platform: linux (x64). Workspace folder /workspaces/cubie, Home = /root 06:59:27.531 [info] Ending debug session 2a373920-e22a-4944-ade5-4db74f7bc070 06:59:27.533 [info] Ending debug session 2a373920-e22a-4944-ade5-4db74f7bc070 ```

vaoville commented 2 months ago

Hi I am having similar issue.

pwang347 commented 2 months ago

Pasting in some more context from another issue: We don't currently support loading PySpark variables but the Jupyter launch button shows it as something that can be launched because the type name happens also to be "DataFrame". We plan to both make the error message more clear as well as investigate the feasibility of PySpark support here.

Thanks!

0xbadidea commented 2 months ago

Looking forward to Pyspark dataframe support in the future! Thank you for your reply.

edfreeman commented 2 months ago

Hi @pwang347!

I've arrived here after learning of the deprecation of the built-in Data Viewer.

The old Data Viewer supported viewing PySpark DataFrames (I've used it for years). It's a bit frustrating that the only replacement that's now offered (Data Wrangler) doesn't at least have parity with the now-deprecated built-in viewer.

We'd rather not have to add lines like x = df.toPandas() around our code wherever we want to use the data viewer. So the only real alternative that I can think of at the moment is to use the debug console with df.show().

In the meantime, it might be nice if the Data Viewer supported expressions in the "Watch" area of the VS Code Debug tab - then we could just have df.toPandas() defined as a Watch expression and right-click->View it from there. Still a bit painful, but maybe a happy middle-ground depending on the complexity?

Thanks! Ed

pwang347 commented 2 months ago

Hi all, thank you for chiming in and for upvoting this feature!

We've just added PySpark support in pre-release 1.9.1 and it currently requires enabling a feature flag as seen in the image below:

image

It would be great if you could give this a try and let us know if you have any feedback. If you're also interested in seeing support for watch expressions, please feel free to check out / upvote https://github.com/microsoft/vscode-data-wrangler/issues/220 as well. Thanks!