linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
291 stars 57 forks source link

Decoding a BytesList #67

Closed albertoandreottiATgmail closed 1 year ago

albertoandreottiATgmail commented 1 year ago

Hello!,

I have the following situation: I work reading a TF Example in which one of the columns is a BytesList, I can read it as a Java String. Now, I would like to decode the original binary data which is a protobuf. So, I go with,

myString.getBytes()

and pass that to the parseFrom() method of my Java class(as created by the proto compiler). This is not working, I'm getting,

CodedInputStream encountered a malformed varint.

My question is, is this the right way to recover the binary buffer? Or is it possible I'm breaking it somehow in the way ?

Thanks!

junshi15 commented 1 year ago

Are you using Spark-TFRecord to read the protobuf? If so, you should just do

spark.read.format("tfrecord").option("recordType", "Example").load(path)

If you are parsing the bytes yourself, then your question is not related to Spark-TFRecord.

albertoandreottiATgmail commented 1 year ago

Yep, I'm using spark-tfrecord, the thing is that I believe that encoding the binary buffer to string and then converting to byte array might be generating data corruption because of the conversions between different encodings... just that.

albertoandreottiATgmail commented 1 year ago

In case it helps others, I just forced the schema for the column to be,

ArrayType(BinaryType)

and then the binary data became usable.

Thanks!