Closed ReemAbdelazim-md closed 3 weeks ago
Hi @ReemAbdelazim-md, thank you for reporting this!
Could you please help provide some more information:
l = [1,2,3]
I am trying to pull data using data connect, manipulate it and then view it using pyspark. The Dataframe type that I am using is pyspark.sql
Hi @ReemAbdelazim-md, thanks for the response! PySpark currently requires enabling this setting (we plan to remove this soon):
Could you please try enabling this and try it again?
I had it enabled, and still the same problem.
Hmm, that's strange that the setting has no effect for you.
Here's my recommendation of next things to try:
import pyspark
ver = pyspark.__version__
print(ver)
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable
print(is_pyspark)
Also just to be sure, is this what you are trying?
The above loads correctly for me when I have the setting enabled, but disabled I get the same error message as you:
Here is the code below, does it also work for you?
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df # <- breakpoint here
running the code above didn't even pass in my debugger as follows:
Could you please let me know what is the output of this code?
import pyspark
ver = pyspark.__version__
print(ver)
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable
print(is_pyspark)
For the following code: import pytest from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() from datetime import datetime, date from pyspark.sql import Row import pyspark
ver = pyspark.version print(ver)
df = spark.createDataFrame([ Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)) ])
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable print(is_pyspark)
if name == "main": pytest.main([file])
This the output:|
I don't know if this will make sense, but I have a sparkengine class that is what I import in order to use to start ,y spark sessions as follows:
from databricks.connect import DatabricksSession from pyspark.sql import SparkSession
def create_spark_engine() -> SparkSession: """ Creates and returns a SparkSession instance using Databricks Connect.
This function initializes a DatabricksSession, which is required when running
Spark applications with Databricks Connect.
:return: A DatabricksSession object to be used for Spark operations.
:rtype: SparkSession
"""
return DatabricksSession.builder.getOrCreate()
spark: SparkSession = create_spark_engine()
Hi @ReemAbdelazim-md, I meant if you could run only this code (and not any of the other code), you will need to replace "df" with the DataFrame you are trying to load:
import pyspark
ver = pyspark.__version__
print(ver)
is_pyspark = isinstance(df, pyspark.sql.dataframe.DataFrame) # replace 'df' with your DataFrame variable
print(is_pyspark)
But I see that you are using DataBricks, so I will test to see if this is caused by having a different path there. Will update the thread when I get a chance to try it out. Thanks!
Hi @ReemAbdelazim-md, I was able to try running the code in DataBricks and it seems to be outputting what I expected. However, I wasn't able to try it with DataBricks Connect with my compute settings. Here is my output from running the above:
3.5.0
True
Could you please confirm the following two things:
l = [1,2,3]
? If not, then it may be that DataBricks Connect in general is not supported, in which case we may need more investigation.Thanks!
I was able to run the following code: import pytest from datetime import datetime, date from pyspark.sql import Row from src.spark_engine import spark
df = spark.createDataFrame([ Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)), Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)), Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0)) ])
if name == "main": pytest.main([file])
I can view the data if I transfer it into pandas, but not when it is a dataframe. Would that be a sparks issue cause I am noticing that most of the highlighted parts in the debug console are the sparks part.
What version of pyspark are you using? We check type using the isinstance
code above, so if that doesn't pass on your version we might need to update our detection logic.
for some reason pyspark --version is showing nothing.
for reference, I was able tot get the databricks extension working so now instead I am using the extension
Hi @ReemAbdelazim-md, thank you for the information and your cooperation.
Will you please run the following code using your dataframe variable and share the exact output:
print(type(df))
It should look something like <class 'pyspark._________________.DataFrame'>
. Looking to confirm that your data is the type we expect, as PySpark actually has multiple classes of DataFrame internally.
Hi @kycutler I am receiving the same issues. Using Databricks Extension (which I believe uses Databricks Connect v2). Running your code on my dataframe I am trying to view I get this:
3.5.0
False
<class 'pyspark.sql.connect.dataframe.DataFrame'>
It seems the DataFrame is of the "connect" class?
Mine is: <class 'pyspark.sql.connect.dataframe.DataFrame'>
Hi @Rooth123, @ReemAbdelazim-md - we've added support for pyspark.sql.connect.dataframe.DataFrame
in pre-release 1.13.1. Note that you will still need this setting enabled
Please feel free to try it out, and let us know if you run into any issues. Thanks!
I am currently trying view some tables that were changed in the local variables in the debugger. A very simple example is when this Part of the code runs " combined_df.join", and I try to right click and view value in data viewer, I get the following error: Can I get some guidance on what my next steps should be cause most issues I read show that you can view your tables from the debugger, but for some reason it is not working on my end. Note: I have both the python and Jupyter extensions downloaded.