SQL magic drops column if all row value is NaN - Githubissues

jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Other

1.33k stars 447 forks source link

SQL magic drops column if all row value is NaN #611

Open benhyy opened 4 years ago

benhyy commented 4 years ago

If a column has null value in every row/record, %%sql will not drop that entire column.

To reproduce, create a table where a column has only null values, e.g. %%sql insert into table values (1, null), (2, null), (3, null)

I have attached screenshots using results from %%sql and spark.sql()

Screen Shot 2019-12-26 at 2.50.52 pm.pdf

Versions:

SparkMagic 0.12.0
Livy 0.6.0
Kernel: Spark

Additional context I believe the problem comes from the fact that since JSON doesn't pick up null values, when the data got converted into dict and then converted into dataframe, it couldn't have known that there was a missing column:

https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/utils.py#L52 https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/livyclientlib/sqlquery.py#L58

We need a way to pick up the schema before populating all the data.

itamarst commented 4 years ago

Yeah, issue is Spark's toJSON() omits null values.

itamarst commented 4 years ago

One solution is to have first line of result be the list of column names. This is annoying to do because sqlquery.py has three different implementations, for Python/Scala/R, but definitely possible.

itamarst commented 4 years ago

Example of the problem:

Input:

%%spark
d = spark.sql("select * from example")
d.show()
print(d.toJSON().collect())
d.columns

Output:

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+

[u'{"a":1}']
['a', 'b']

itamarst commented 4 years ago

The code that needs fixing is _pyspark_command, _scale_command, and _r_command in sparkmagic/sparkmagic/livyclientlib/sqlquery.py. @benhyy any interest in submitting a PR?

itamarst commented 4 years ago

If we ever get #598 in we can drop two of the languages.

As a third alternative, I think I have some memory of SQL support in Livy, too? But that would presumably not less us fix this bug (although perhaps Livy has fixed it).

ramp818 commented 4 years ago

Hello is there any more news on this? I'm currently facing the same problem. Thanks