rjurney / Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
http://bit.ly/agile_data_science
MIT License
456 stars 307 forks source link

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape #30

Closed kelvinksau closed 7 years ago

kelvinksau commented 7 years ago
airlines = spark.read.format('com.databricks.spark.csv')\
  .options(header='false', nullValue='\N')\
  .load('data/airlines.csv')
airlines.show()

should change to

airlines = spark.read.format('com.databricks.spark.csv')\
  .options(header='false', nullValue='\\N')\
  .load('data/airlines.csv')
airlines.show()
rjurney commented 7 years ago

It looks like your version doesn't use the schema we have to use to load this data, so it ends up being:

airlines = spark.read.format('com.databricks.spark.csv')\
  .options(header='false', nullValue='\\N')\
  .schema(schema)
  .load('data/airlines.csv')
airlines.show()

See https://github.com/rjurney/Agile_Data_Code_2/commit/014f07608e5beaaa1e8292dc3fb9460eb317d298 and https://github.com/rjurney/Agile_Data_Code_2/commit/a5af14613a0008da49795c9cc176258b3776282d