Open mkrasmus opened 3 years ago
This appears to be by design. Apparently, when you print the schema, whether the schema was inferred or defined, the nullable
property is a reflection of the schema based on inheritance or inferencing based on the data itself. References here and here.
Another workaround, according to this JIRA ticket, is to read into an RDD instead of a DataFrame in order to properly apply the nullability of the schema. Using RDDs vs. DataFrames has its own downsides, but doing this also this workaround results in null values being set to default ones (eg. 0 instead of null for integers).
The general consensus appears to be that you should not rely on enforcing nullability in the schema, but allow all values/non-values, then use methods like df.na.drop()
or fillna()
to handle null values. Or to throw an exception during processing if invalid.
All well and good but consider the notebook in terms of a learning resource.
Fair point. I should just modify the schema to set nullable to True, due to this behavior.
Or just remove content about that option/argument to set null to True/False given that its fixed and just a bit of a distraction.
Hi I am working through the 3rd notebook to read a csv and am up to this chunk without any modification:
I get this result:
I was expecting this: