linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
291 stars 57 forks source link

How to save tfrecord data in SequenceExample format? #42

Closed ZhouM1118 closed 2 years ago

ZhouM1118 commented 2 years ago

I save tfrecord data in SequenceExample format with writeDF.write.mode(writeMode).format("tfrecord").option("recordType", "SequenceExample"), then I read tfrecord data with

from tensorflow_serving.apis import input_pb2
def read_and_print_tf_record(target_filename, num_of_examples_to_read):
    filenames = [target_filename]
    tf_record_dataset = tf.data.TFRecordDataset(filenames)

    for raw_record in tf_record_dataset.take(num_of_examples_to_read):
        print("=======")
        example = tf.train.Example()
        example.ParseFromString(raw_record.numpy())
        print(example)

output is

features {
  feature {
    key: "f1"
    value {
      float_list {
        value: 0.0
        value: 0.0
      }
    }
  }
  feature {
    key: "f1"
    value {
      float_list {
        value: -0.20067068934440613
        value: -0.35234645432452613
      }
    }
  }
  feature {
    key: "f2"
    value {
      float_list {
        value: 1.2820905447006226
        value: 1.6754634565634563
      }
    }
  }
}

but I expected output is

sequence_example {
  context {
    feature {
      key: "f1"
      value { float_list { value: 0.0 } }
    }
  }
  feature_lists {
    feature_list {
      key: "f2"
      value {
        feature { float_list { value: -0.20067068934440613 } }
        feature { float_list { value: -0.35234645432452613 } }
      }
    }
    feature_list {
      key: "f3"
      value {
        feature { float_list { value: 1.2820905447006226 } }
        feature { float_list { value: 1.6754634565634563 } }
      }
    }
  }
}

like https://github.com/tensorflow/ranking/blob/master/tensorflow_ranking/python/data.py SequenceExample format. so how to save tfrecord data like above SequenceExample format? Thanks for your reply and help!

junshi15 commented 2 years ago

If your data is SequenceExamples, then shouldn't you use tf.train.SequenceExample() in the line below (when you read back the record):

example = tf.train.Example() ==> example = tf.train.SequenceExample()

ZhouM1118 commented 2 years ago

I solved the problem, thanks a lot. There is another question, can I generate tfrecord files in ELWC format use spark-tfrecord? for example

serialized = [
    example_list_with_context = {
      context {
        features {
          feature {
            key: "query_length"
            value { int64_list { value: 3 } }
          }
        }
      }
      examples {
        features {
          feature {
            key: "unigrams"
            value { bytes_list { value: "tensorflow" } }
          }
          feature {
            key: "utility"
            value { float_list { value: 0.0 } }
          }
        }
      }
      examples {
        features {
          feature {
            key: "unigrams"
            value { bytes_list { value: ["learning" "to" "rank" } }
          }
          feature {
            key: "utility"
            value { float_list { value: 1.0 } }
          }
        }
      }
    }
    example_list_with_context = {
      context {
        features {
          feature {
            key: "query_length"
            value { int64_list { value: 2 } }
          }
        }
      }
      examples {
        features {
          feature {
            key: "unigrams"
            value { bytes_list { value: ["gbdt"] } }
          }
          feature {
            key: "utility"
            value { float_list { value: 0.0 } }
          }
        }
      }
      examples {
        features {
          feature {
            key: "unigrams"
            value { bytes_list { value: ["neural", "networks"] } }
          }
          feature {
            key: "utility"
            value { float_list { value: 1.0 } }
          }
        }
      }
    }
  ]
junshi15 commented 2 years ago

I don't have experience with ELWC format. If it can be re-cast to SequenceExample or Example, then it is easy to generate. But if not, then it may be hard.

ZhouM1118 commented 2 years ago

ok I got it. Thanks again for your help and reply!