Open benhyy opened 4 years ago
Yeah, issue is Spark's toJSON()
omits null values.
One solution is to have first line of result be the list of column names. This is annoying to do because sqlquery.py
has three different implementations, for Python/Scala/R, but definitely possible.
Example of the problem:
Input:
%%spark
d = spark.sql("select * from example")
d.show()
print(d.toJSON().collect())
d.columns
Output:
+---+----+
| a| b|
+---+----+
| 1|null|
+---+----+
[u'{"a":1}']
['a', 'b']
The code that needs fixing is _pyspark_command
, _scale_command
, and _r_command
in sparkmagic/sparkmagic/livyclientlib/sqlquery.py
. @benhyy any interest in submitting a PR?
If we ever get #598 in we can drop two of the languages.
As a third alternative, I think I have some memory of SQL support in Livy, too? But that would presumably not less us fix this bug (although perhaps Livy has fixed it).
Hello is there any more news on this? I'm currently facing the same problem. Thanks
If a column has null value in every row/record, %%sql will not drop that entire column.
To reproduce, create a table where a column has only null values, e.g. %%sql insert into table values (1, null), (2, null), (3, null)
I have attached screenshots using results from %%sql and spark.sql()
Screen Shot 2019-12-26 at 2.50.52 pm.pdf
Versions:
Additional context I believe the problem comes from the fact that since JSON doesn't pick up null values, when the data got converted into dict and then converted into dataframe, it couldn't have known that there was a missing column:
https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/utils.py#L52 https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/livyclientlib/sqlquery.py#L58
We need a way to pick up the schema before populating all the data.