microsoft / vscode-data-wrangler

Other
439 stars 20 forks source link

“Support for DataFrame type in Data Wrangler #309

Closed ReemAbdelazim-md closed 3 weeks ago

ReemAbdelazim-md commented 1 month ago

I am currently trying view some tables that were changed in the local variables in the debugger. A very simple example is when this Part of the code runs " combined_df.join", and I try to right click and view value in data viewer, I get the following error: image Can I get some guidance on what my next steps should be cause most issues I read show that you can view your tables from the debugger, but for some reason it is not working on my end. Note: I have both the python and Jupyter extensions downloaded.

pwang347 commented 1 month ago

Hi @ReemAbdelazim-md, thank you for reporting this!

Could you please help provide some more information:

ReemAbdelazim-md commented 1 month ago

I am trying to pull data using data connect, manipulate it and then view it using pyspark. The Dataframe type that I am using is pyspark.sql

pwang347 commented 1 month ago

Hi @ReemAbdelazim-md, thanks for the response! PySpark currently requires enabling this setting (we plan to remove this soon): image

Could you please try enabling this and try it again?

ReemAbdelazim-md commented 1 month ago

I had it enabled, and still the same problem.

pwang347 commented 1 month ago

Hmm, that's strange that the setting has no effect for you.

Here's my recommendation of next things to try:

import pyspark
ver = pyspark.__version__
print(ver)
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable
print(is_pyspark)

Also just to be sure, is this what you are trying?

image

The above loads correctly for me when I have the setting enabled, but disabled I get the same error message as you:

image

Here is the code below, does it also work for you?

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df # <- breakpoint here
ReemAbdelazim-md commented 1 month ago

running the code above didn't even pass in my debugger as follows: image

pwang347 commented 1 month ago

Could you please let me know what is the output of this code?

import pyspark
ver = pyspark.__version__
print(ver)
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable
print(is_pyspark)
ReemAbdelazim-md commented 1 month ago

For the following code: import pytest from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() from datetime import datetime, date from pyspark.sql import Row import pyspark

ver = pyspark.version print(ver)

df = spark.createDataFrame([ Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)) ])

is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable print(is_pyspark)

if name == "main": pytest.main([file])

This the output:|

image

I don't know if this will make sense, but I have a sparkengine class that is what I import in order to use to start ,y spark sessions as follows:

from databricks.connect import DatabricksSession from pyspark.sql import SparkSession

def create_spark_engine() -> SparkSession: """ Creates and returns a SparkSession instance using Databricks Connect.

This function initializes a DatabricksSession, which is required when running 
Spark applications with Databricks Connect.

:return: A DatabricksSession object to be used for Spark operations.
:rtype: SparkSession
"""
return DatabricksSession.builder.getOrCreate()

Initialize Spark session

spark: SparkSession = create_spark_engine()

pwang347 commented 1 month ago

Hi @ReemAbdelazim-md, I meant if you could run only this code (and not any of the other code), you will need to replace "df" with the DataFrame you are trying to load:

import pyspark
ver = pyspark.__version__
print(ver)
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable
print(is_pyspark)

But I see that you are using DataBricks, so I will test to see if this is caused by having a different path there. Will update the thread when I get a chance to try it out. Thanks!

pwang347 commented 1 month ago

Hi @ReemAbdelazim-md, I was able to try running the code in DataBricks and it seems to be outputting what I expected. However, I wasn't able to try it with DataBricks Connect with my compute settings. Here is my output from running the above:

3.5.0
True

Could you please confirm the following two things:

  1. Can you try loading a regular variable such as l = [1,2,3]? If not, then it may be that DataBricks Connect in general is not supported, in which case we may need more investigation.
  2. Can you please share the output for running the code I shared in the previous comment? It would be great to know which version of pyspark you are on, and if there are any issues with the type check.

Thanks!

ReemAbdelazim-md commented 1 month ago

I was able to run the following code: import pytest from datetime import datetime, date from pyspark.sql import Row from src.spark_engine import spark

df = spark.createDataFrame([ Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)) ])

if name == "main": pytest.main([file])

I can view the data if I transfer it into pandas, but not when it is a dataframe. Would that be a sparks issue cause I am noticing that most of the highlighted parts in the debug console are the sparks part.

pwang347 commented 1 month ago

What version of pyspark are you using? We check type using the isinstance code above, so if that doesn't pass on your version we might need to update our detection logic.

ReemAbdelazim-md commented 1 month ago

for some reason pyspark --version is showing nothing.

ReemAbdelazim-md commented 1 month ago

for reference, I was able tot get the databricks extension working so now instead I am using the extension

kycutler commented 1 month ago

Hi @ReemAbdelazim-md, thank you for the information and your cooperation.

Will you please run the following code using your dataframe variable and share the exact output:

print(type(df))

It should look something like <class 'pyspark._________________.DataFrame'>. Looking to confirm that your data is the type we expect, as PySpark actually has multiple classes of DataFrame internally.

Rooth123 commented 1 month ago

Hi @kycutler I am receiving the same issues. Using Databricks Extension (which I believe uses Databricks Connect v2). Running your code on my dataframe I am trying to view I get this:

3.5.0

False

<class 'pyspark.sql.connect.dataframe.DataFrame'>

It seems the DataFrame is of the "connect" class?

ReemAbdelazim-md commented 1 month ago

Mine is: <class 'pyspark.sql.connect.dataframe.DataFrame'>

pwang347 commented 3 weeks ago

Hi @Rooth123, @ReemAbdelazim-md - we've added support for pyspark.sql.connect.dataframe.DataFrame in pre-release 1.13.1. Note that you will still need this setting enabled Image

Please feel free to try it out, and let us know if you run into any issues. Thanks!