tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 391 forks source link

BytesList with length 0 or 1 is inferred to have StringType instead of ArrayType #159

Open jukujala opened 4 years ago

jukujala commented 4 years ago

If BytesList in TFRecords has always length of 0 or 1, then the feature is inferred to have StringType instead of ArrayType. Is there a reason for this behavior? With this behavior you can write a DataFrame as TFRecords, but you can't read those TFRecords back to a DataFrame. Zero length BytesList is valid in Tensorflow.

Below is the implementation of the parseBytesList from https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L144:

  private def parseBytesList(feature: Feature): DataType = {
    val length = feature.getBytesList.getValueCount

    if (length == 0) {
      null
    }
    else if (length > 1) {
      ArrayType(StringType)
    }
    else {
      StringType
    }
  }
liusulizzu commented 2 years ago

i also hit this problem , do you have any solutions