03 declaring schema does not make nullable False

solliancenet / microsoft-learning-paths-databricks-notebooks

Contains notebooks used in the Microsoft Azure Databricks Learning Paths modules.

173 stars 173 forks source link

03 declaring schema does not make nullable False #6

Open mkrasmus opened 3 years ago

mkrasmus commented 3 years ago

Hi I am working through the 3rd notebook to read a csv and am up to this chunk without any modification:

# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("timestamp", StringType(), False),
  StructField("site", StringType(), False),
  StructField("requests", IntegerType(), False)
])

I get this result:

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)

I was expecting this:

root
 |-- timestamp: string (nullable = false)
 |-- site: string (nullable = false)
 |-- requests: integer (nullable = false)

joelhulen commented 3 years ago

This appears to be by design. Apparently, when you print the schema, whether the schema was inferred or defined, the nullable property is a reflection of the schema based on inheritance or inferencing based on the data itself. References here and here.

Another workaround, according to this JIRA ticket, is to read into an RDD instead of a DataFrame in order to properly apply the nullability of the schema. Using RDDs vs. DataFrames has its own downsides, but doing this also this workaround results in null values being set to default ones (eg. 0 instead of null for integers).

The general consensus appears to be that you should not rely on enforcing nullability in the schema, but allow all values/non-values, then use methods like df.na.drop() or fillna() to handle null values. Or to throw an exception during processing if invalid.

mkrasmus commented 3 years ago

All well and good but consider the notebook in terms of a learning resource.

joelhulen commented 3 years ago

Fair point. I should just modify the schema to set nullable to True, due to this behavior.

mkrasmus commented 3 years ago

Or just remove content about that option/argument to set null to True/False given that its fixed and just a bit of a distraction.