moj-analytical-services / etl_manager

A python package to create a database on the platform using our moj data warehousing framework
21 stars 8 forks source link

Floating point numbers should be represented using doubles in Athena #65

Closed RobinL closed 5 years ago

RobinL commented 5 years ago

Parquet's 'float' type is actually a 32 bit float and its 'double' type is 64 bit.

If you write a parquet out from Panads, Athena will refuse to read the file if you set the datatype to flat, because it's a double within the parquet file.

Your query has the following error(s):

HIVE_BAD_DATA: Field X's type DOUBLE in parquet is incompatible with type float defined in table schema

Parquet file formats are enumerated here.

https://drill.apache.org/docs/parquet-format/

and Sparks are here:

http://spark.apache.org/docs/2.2.1/api/python/_modules/pyspark/sql/types.html#FloatType