holdenk / spark-testing-base

Base classes to use when writing tests with Spark
Apache License 2.0
1.51k stars 359 forks source link

`assertDataFrameNoOrderEquals` fails to catch some inequalities #318

Closed IvanVergiliev closed 3 years ago

IvanVergiliev commented 4 years ago

The recently introduced assertDataFrameNoOrderEquals method fails to catch DF inequality for some Dataframes. Specifically, if a row is present once in one DF, and not present in another, no error is reported.

For example, the following two Dataframes are considered equal by the current implementation:

    val input = spark.createDataFrame(Seq(
      (1, "one")
    ))
    val input2 = spark.createDataFrame(Seq(
      (1, "oneone")
    ))

As suggested in the contributing guidelines, I'm opening the issue to match an incoming PR. I've also added a more detailed explanation of the exact failure reason in the commit message.

larrykooper commented 4 years ago

I have the same issue - this passes

   val dataFrame1 = Seq(
      ("scotch_tape", "Scotch Tape")
    ).toDF("brand_label", "brand_display_name")

    val dataFrame2 = Seq(
      ("scotch_tape", "   ")
    ).toDF("brand_label", "brand_display_name")

    assertDataFrameNoOrderEquals(dataFrame1, dataFrame2)
so1clstl commented 3 years ago

Have the same issue, currently worked around with assertDataFrameEquals. assertDataFrameDataEquals seems to be affected as well.