samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

class cast exception has occurs (Double cannot be cast to Float) #57

Closed smdmts closed 6 years ago

smdmts commented 6 years ago

Hi, I'm trying to analyze firebase data using Spark with this spark-Bigquery. But class cast exception has occurred like Double cannot be cast to Float. Additionally, Double type exists in the Avro specs, but it seems only Float type casting in the module. (https://avro.apache.org/docs/1.8.1/spec.html)

Would you mind tell me is this a bug?

https://support.google.com/firebase/answer/7029846

Error Detail

samelamin commented 6 years ago

Hi @smdmts correct, we should be casting float to float not double to float

Good find!

Feel free to send a pr in

michTalebzadeh commented 5 years ago

Hi, Using the below in Scala code import com.samelamin.spark.bigquery._

I have a Hive table imported to BigQuery through avro file and table is created in BQ as follows

image

It is pretty simple. The code tries to load this table first

`//read data from BigQuery Table println("\nreading data from " + fullyQualifiedInputTableId)

val df = spark.sqlContext .read .format("com.samelamin.spark.bigquery") .option("tableReferenceSource",fullyQualifiedInputTableId) .load()

df.printSchema

// create a temporary view on DF df.createOrReplaceTempView ("tmp") ` OK this is the output

reading data from axial-glow-224522:accounts.ll_18201960 root |-- transactiondate: string (nullable = true) |-- transactiontype: string (nullable = true) |-- sortcode: string (nullable = true) |-- accountnumber: string (nullable = true) |-- transactiondescription: string (nullable = true) |-- debitamount: float (nullable = true) |-- creditamount: float (nullable = true) |-- balance: float (nullable = true)

The tmp view is created. However, when trying to read debitamount defined as float, I am getting the following error

spark.sql("select transactiondate,transactiontype, sortcode, accountnumber, transactiondescription, debitamount from tmp").collect.foreach(println)

18/12/27 19:41:59 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, rhes77-cluster-w-1.europe-west2-a.c.axial-glow-224522.internal, executor 1): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Float at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:109) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getFloat(rows.scala:43) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getFloat(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Any workaround on this if exists please!

Thanks,

Mich

michTalebzadeh commented 5 years ago

Hi,

I now have a work-around for this issue using Spark DF transformation to cast date from String to Date and String to Double where appropriate and then save the data in BigQuery table.

Let me know your thoughts.

Thanks