Closed schCRABicus closed 3 years ago
Thanks for your fix.
For the example above, can you show the result after your fix, what would loadedWithSchemaDf.show look like?
@junshi15 , yes, with the fix applied, the output looks as follows (corresponds to initial dataset written to file):
+---+----+----+----+----+----+
| id| foo| bar| baz| bcs| abc|
+---+----+----+----+----+----+
| 11| 1| 2|null|null|null|
| 21| 4|null| 3|null|null|
| 31|null|null|null| 8| 7|
| 41| 6|null| 5|null|null|
+---+----+----+----+----+----+
Also, I've provided a dedicated test case as part of the PR and it shows the expected behaviour. Without the changes in TFRecordDeserializer
, test case fails because the val expectedInternalRow2
is
InternalRow.fromSeq(
Array[Any](10.0F, 1, null)
)
not
val expectedInternalRow2 = InternalRow.fromSeq(
Array[Any](null, 1, null)
)
I.e. inherits FloatLabel from first deserialised record.
Thanks for your contribution!
@junshi15 , may I ask you please to publish the new artifact so that we could start using it in production code? Thank you!
It's already published here: https://search.maven.org/search?q=a:spark-tfrecord_2.12
Or are you using Scala 2.11? Spark 3.x uses Scala 2.12. For Scala 2.11, I need to build with Spark 2.4 or Spark 2.3.
I was not able to publish to Bintray since it is deprecated https://jfrog.com/center-sunset/.
Got it, thank you! I'm using 2.12, so everything is fine, just didn't saw it before, sorry. Thank you!
Problem
We have an issue when deserializing tfrecords. The problem seems to exist in case when sequential records have different features inside.
The issue occurs when sequential records with different features inside are deserialised. In this case the subsequent row inherits the missing values from the preceding row which leads to incorrect deserialisation.
Here is the example which highlights the issue:
The outcome is that the initial
explodedDf.show()
prints the correct dataset -meanwhile the dataset, read from tfrecord file and printed by
loadedWithSchemaDf.show()
looks as follows :Note that rows starting from second and till the end inherited the missing column data from the preceding rows thus resulting in incorrect dataset.
Root Cause
The root cause of the issue is the usage of private variable (i.e. state shared across multiple deserialisations) for result row in
TFRecordDeserializer
-private val resultRow = new SpecificInternalRow(dataSchema.map(_.dataType))
. With this, each subsequent record has pre-filled all columns and therefore if any is missing in this specific record, it's inherited from previous record deserialisation.Solution
I suggest we initialise the result row for each record being deserialised. It solves the issue for us.