gowerc / diffdf

DataFrame Comparison Tool
https://gowerc.github.io/diffdf/
Other
42 stars 5 forks source link

Add tests for niche datetime scenarios #121

Closed gowerc closed 2 weeks ago

gowerc commented 1 month ago

I don't think we have any explicit handling / tests for datetime variables in particular for when they have different timezones.

Whilst I suspect diffdf should behave as expected it would be good to add some unit tests to confirm that.

gowerc commented 1 month ago

Thats annoying, looks like it does result in some confusing output:

    d1 <- tibble(
        id = c(1, 2),
        dt1 = lubridate::ymd_hms(
            "2024-05-01 14-01-40",
            "2024-05-01 14-01-40",
            tz = "EST"
        )
    )

    d2 <- tibble(
        id = c(1, 2),
        dt1 = lubridate::ymd_hms(
            "2024-05-01 14-01-40",
            "2024-05-01 14-01-40",
            tz = "CET"
        )
    )

    diffdf(d1, d2, "id")

There are columns in BASE and COMPARE with differing attributes !!
  ===============================================
   VARIABLE  ATTR_NAME  VALUES.BASE  VALUES.COMP 
  -----------------------------------------------
     dt1       tzone        EST          CET     
  -----------------------------------------------

Not all Values Compared Equal
  =============================
   Variable  No of Differences 
  -----------------------------
     dt1             2         
  -----------------------------

  ========================================================
   VARIABLE  id         BASE                COMPARE       
  --------------------------------------------------------
     dt1     1   2024-05-01 14:01:40  2024-05-01 14:01:40 
     dt1     2   2024-05-01 14:01:40  2024-05-01 14:01:40 
  --------------------------------------------------------

Whilst it does show the time zones are different in the attributes section it is confusing as that the times appear the same in the data comparison (this is because whilst the underlying numeric is different the as.character display casts it to the same time).

I'm thinking we need to add a datetime specific format to show the TZ in the table print

gowerc commented 1 month ago

hmmm this ones is even more confusing...

    d1 <- tibble(
        id = c(1, 2),
        dt1 = lubridate::ymd_hms(
            "2024-05-01 14-01-40",
            "2024-05-01 14-01-40",
            tz = "EST"
        )
    )
    d2 <- d1
    d2$dt1 <- lubridate::with_tz(d2$dt1, tzone = "CET")
    diffdf(d1, d2, "id")

Differences found between the objects!

Summary of BASE and COMPARE
  ==================================================================
    PROPERTY             BASE                       COMP            
  ------------------------------------------------------------------
      Name                d1                         d2             
     Class     "tbl_df, tbl, data.frame"  "tbl_df, tbl, data.frame" 
    Rows(#)                2                          2             
   Columns(#)              2                          2             
  ------------------------------------------------------------------

There are columns in BASE and COMPARE with differing attributes !!
  ===============================================
   VARIABLE  ATTR_NAME  VALUES.BASE  VALUES.COMP 
  -----------------------------------------------
     dt1       tzone        EST          CET     
  -----------------------------------------------
> d1
# A tibble: 2 × 2
     id dt1                
  <dbl> <dttm>             
1     1 2024-05-01 14:01:40
2     2 2024-05-01 14:01:40
> d2
# A tibble: 2 × 2
     id dt1                
  <dbl> <dttm>             
1     1 2024-05-01 21:01:40
2     2 2024-05-01 21:01:40

That is when browsing the data set the values look completely different but in diffdf the values compare the same (because the underlying numeric is the same its only the TZ that is different). I guess the current handling (albeit confusing) is fine because diffdf does at least catch the fact that the TZ is different.