linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
291 stars 57 forks source link

Serializing ByteArray without TFExample feature wrapping #59

Closed yozenliu closed 1 year ago

yozenliu commented 1 year ago

Hi,

We are using this repo to help us write protos to tfrecords format files. The protos can be transformed into byte arrays for writing. However when using .write.format("tfrecord"), the only recordType are Example and SequenceExample. Using the Example options will basically store our proto byte array as a feature of TFExample instead of desired, which is each proto byte array as a record. This causes some future complication and slowdown when reading the tfrecord files in the next step.

For this specific problem, I think we've came up with an easy workaround to add a byte array option and customized serialization for it, and we can submit a PR. But we'd like to check with the repo owners to see if you have other ideas or solutions to this problem? Thanks!

junshi15 commented 1 year ago

You already serialize proto to byte array, then you want to use this library to write it to files? If my understanding is correct, this is not the intended use of this library. We assume the data is defined in dataframe, then serialize to either Example or SequenceExample. These are the only formats supported by TFRecord. As far as I know, TFRecord does not support arbitrary proto schema.

yozenliu commented 1 year ago

Hi @junshi15 , thanks for the quick reply! TFRecord is a convenient format for us in our framework to store and read feature, data, as well as protos. IIUC in the TFRecordOutputWriter.scala code, the TFExamples are eventually serialized to byte arrays too before writing to tfrecord files writer.write(record.toByteArray). Given that it could also be reasonable for users to provide byte arrays directly to the writer too.

I'm not sure if you'd want to add this support, but I did drafted and tested a simple solution https://github.com/linkedin/spark-tfrecord/pull/60 if you'd like to take a look. I think this could help extend the usage of this framework not limiting to tf examples

junshi15 commented 1 year ago

Thanks for your PR. it has been merged. v0.5.1 has the new feature.