assertDataFrameApproximateEquals params show when failing

eruizalo commented 9 months ago

Hello @holdenk , I've noticed that since commit 6f812cc9, the way of displaying DataFrames that are not similar has been changed using the assertDataFrameApproximateEquals method. The new approach involves a show, using the default parameters (truncating the text and limiting the number of rows to 20).

I would like to be able to parameterize this "show" in some way so that at least the truncate and numRows parameters may be provided to the assertDataFrameApproximateEquals method. Another option could be to use a lambda DataFrame => Unit param (default lambda would keep current functionality) so that we can customize it per project. This way, we could even convert the DataFrame into JSON if we have complex structures and properly see which fields are not equal.

I've been working on this and other contributions (but I'm not sure if do you want an issue for each feature):

Timestamp tolerance (using java.timeDuration) so a different tolerance can be used for decimals and timestamps in the same dataframe
Struct approx equality (recursive calling): I think if you use tolerance in a struct it currently fails within a valid tolerance
BigDecimal tolerance: #213 didn't merge and now the PR might be conflictive, so I took the liberty of adding it

What do you think?

holdenk commented 9 months ago

Using show sounds good, and custom lambda for comparison sounds reasonable provided we keep the current default comparison.

I'm not so sure about JSON based comparison that sounds slow/brittle to me but could be wrong.

eruizalo commented 9 months ago

Hi @holdenk,

I'm not sure if I explained myself properly and I had the features almost finished, so I think it is better to show them to you.

I tried to split this issue into 4 commits, each commit should contain only a possible feature, so if you don't like any of them it can be removed or updated individually. I may also create different issues/PRs if you prefer that:

Timestamp tolerance - cb0b0ac: Here I try to differ timestamp tolerance from decimal tolerance, this way we can test both tolerances in the same row (decimal tolerance 0 > tol > 1, but timestamp tolerance usually > 1000)
Big Decimal tolerance - 88b173d copy/paste from #213
Enable struct approx equality - 3ebf545: I think it fails when a decimal or a timestamp is within a structure but it should pass the validation due to tolerance
Custom show when failing - 5e77125: Actually if a DataFrame differ from another it simply calls a show() without any param, so big columns are truncated. With a custom show function we may do: df.show(50, false) or df.toJSON.show(false) (this is what I meant when talking about json/big structures). Now is a dev decision how to show the differences and how many (default behaviour remains)

Please, tell me what you think about these features and whether I should separate them into new issues/PRs or if they require any changes.

holdenk commented 9 months ago

Ah that makes more sense sorry for misunderstanding part of the first issue. This sounds reasonable do you have a PR up I can review?

eruizalo commented 9 months ago

Hi @holdenk, I just created #405 let me know your insights

holdenk commented 9 months ago

Fantastic I'll try and take a look either tomorrow or early next week :)

holdenk / spark-testing-base

assertDataFrameApproximateEquals params show when failing #404