linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
291 stars 57 forks source link

Error: java.lang.ClassCastException: com.linkedin.spark.shaded.org.tensorflow.example.FeatureList cannot be cast to com.linkedin.spark.shaded.org.tensorflow.example.Feature #44

Open nitinware opened 2 years ago

nitinware commented 2 years ago

I am trying to write a spark df to 'tfrecord' df.write.mode("overwrite").format("tfrecord").option("recordType", "tfrecords").save(outputPath + '/tf-records/') I am running on gcp dataproc cluster which comes with spark version '3.1.2' and I am using spark-tfrecord jar - 'spark-tfrecord_2.12-0.3.4.jar'

Seeing below error on write operation -

22/01/21 05:33:13 ERROR org.apache.spark.util.Utils: Aborting task
java.lang.IllegalArgumentException: Unsupported recordType tfrecords: recordType can be Example or SequenceExample
    at com.linkedin.spark.datasources.tfrecord.TFRecordOutputWriter.write(TFRecordOutputWriter.scala:33)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

Appreciate your inputs on this issue, Thanks.

junshi15 commented 2 years ago

The error message is very clear. recordType can be Example or SequenceExample

Instead of .option("recordType", "tfrecords"), you should use .option("recordType", "Example")` or SequenceExample.

Please take a look at the README file. https://github.com/linkedin/spark-tfrecord#features

nitinware commented 2 years ago

thanks for quick response seeing below error now, appreciate ur inputs, thanks -


java.lang.ClassCastException: com.linkedin.spark.shaded.org.tensorflow.example.FeatureList cannot be cast to com.linkedin.spark.shaded.org.tensorflow.example.Feature
    at com.linkedin.spark.datasources.tfrecord.TFRecordSerializer.$anonfun$serializeExample$1(TFRecordSerializer.scala:22)
    at com.linkedin.spark.datasources.tfrecord.TFRecordSerializer.$anonfun$serializeExample$1$adapted(TFRecordSerializer.scala:19)
    at scala.collection.immutable.Range.foreach(Range.scala:158)
    at com.linkedin.spark.datasources.tfrecord.TFRecordSerializer.serializeExample(TFRecordSerializer.scala:19)
    at com.linkedin.spark.datasources.tfrecord.TFRecordOutputWriter.write(TFRecordOutputWriter.scala:29)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
junshi15 commented 2 years ago

I am guessing your data is "SequenceExample", but you try to write it as "Example".