linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
290 stars 57 forks source link

How to save TFRecord from RDD[Example]? #41

Closed wuxianxingkong closed 2 years ago

wuxianxingkong commented 2 years ago

In ReadME, TFRecord can only be saved using predefined schema which is not convient for dynamic column(key) in Features. If using Spark Tensorflow Connector, code can be very simple:

import org.tensorflow.spark.shaded.org.tensorflow.hadoop.io.{TFRecordFileInputFormat, TFRecordFileOutputFormat}

// textRawFeatures for example: feature1:value1,feature2:value2 
val rdd = sc.textFile("/xxx/input_path").map(textRawFeatures => {
  val rowFeatures = Features.newBuilder()
  textRawFeatures.split(",").map(_.split(":")).foreach(pair => {
      rowFeatures.putFeature(pair(0), Feature.newBuilder.setBytesList(FloatList.newBuilder().addValue(pair(1))).build)
  })
  (new BytesWritable(Example.newBuilder.setFeatures(features.build).build.toByteArray), NullWritable.get)
})
rdd.saveAsNewAPIHadoopFile[TFRecordFileOutputFormat]("/xxx/output_path")

But when using spark-tfrecord, TFRecordFileOutputFormat is not included and schema must be predefined with List[StuctType] which means I have to iterator RDD to extract all columns(keys) and rebuild RDD with sorted columns(keys). How to save TFRecord from RDD[Example] directly?

Additionally, when using Spark 3.x, Spark Tensorflow Connector is built with scala 2.11 which is imcompatible with spark 3.x(built with scala 2.12) which means I can't import it into pom.xml for backward compatibility.

junshi15 commented 2 years ago

The README example you pointed to is the same as the README here: https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector#scala-api So there is no difference between Spark-TFRecord and Spark-Tensorflow-Connector in this regard.

Your example uses tensorflow.hadoop.io, which is a library Spark-TFRecord built upon. You can continue to use it if you feel it is more useful for you.

I don't know to to save from RDD directly. Spark-TFRecord is build on DataFrame. Maybe you can convert RDD to DataFrame first.

Spark-Tensorflow-Connector is not built by us. Please contact the authors if you want to have a scala 2.12 version of it.

wuxianxingkong commented 2 years ago

@junshi15 Thanks for your reply. Using org.tensorflow:tensorflow-hadoop:1.15.0 solved my question.