insightsengineering / teal.data

Data model for teal applications
https://insightsengineering.github.io/teal.data/
Other
9 stars 8 forks source link

two print outputs when printing `TealDataset` object #68

Closed pawelru closed 2 years ago

pawelru commented 2 years ago

https://github.com/insightsengineering/teal.data/blob/4588016464044cfca5cbd8df75cd525c035cdc28/R/TealDataset.R#L126-L130

Above is printing twice in most of the cases. As an example - please open pkgdown documentation for cdisc_dataset - in the examples section you would see two print outputs.

library(scda)

ADSL <- synthetic_cdisc_data("latest")$adsl

cdisc_dataset("ADSL", ADSL, metadata = list(type = "scda", date = "latest"))
#> A CDISCTealDataset object containing the following data.frame (400 rows and 44 columns):
#>   STUDYID               USUBJID SUBJID SITEID AGE  AGEU SEX
#> 1 AB12345  AB12345-CHN-3-id-128 id-128  CHN-3  32 YEARS   M
#> 2 AB12345 AB12345-CHN-15-id-262 id-262 CHN-15  35 YEARS   M
#> 3 AB12345  AB12345-RUS-3-id-378 id-378  RUS-3  30 YEARS   F
#> 4 AB12345 AB12345-CHN-11-id-220 id-220 CHN-11  26 YEARS   F
#> 5 AB12345  AB12345-CHN-7-id-267 id-267  CHN-7  40 YEARS   M
#> 6 AB12345 AB12345-CHN-15-id-201 id-201 CHN-15  49 YEARS   M
#>                        RACE                 ETHNIC COUNTRY DTHFL         INVID
#> 1                     ASIAN NOT HISPANIC OR LATINO     CHN     N  INV ID CHN-3
#> 2 BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO     CHN     N INV ID CHN-15
#> 3                     ASIAN NOT HISPANIC OR LATINO     RUS     N  INV ID RUS-3
#> 4                     ASIAN NOT HISPANIC OR LATINO     CHN     N INV ID CHN-11
#> 5                     ASIAN                UNKNOWN     CHN     N  INV ID CHN-7
#> 6                     ASIAN NOT HISPANIC OR LATINO     CHN     N INV ID CHN-15
#>           INVNAM            ARM ARMCD         ACTARM ACTARMCD         TRT01P
#> 1  Dr. CHN-3 Doe      A: Drug X ARM A      A: Drug X    ARM A      A: Drug X
#> 2 Dr. CHN-15 Doe C: Combination ARM C C: Combination    ARM C C: Combination
#> 3  Dr. RUS-3 Doe C: Combination ARM C C: Combination    ARM C C: Combination
#> 4 Dr. CHN-11 Doe     B: Placebo ARM B     B: Placebo    ARM B     B: Placebo
#> 5  Dr. CHN-7 Doe     B: Placebo ARM B     B: Placebo    ARM B     B: Placebo
#> 6 Dr. CHN-15 Doe C: Combination ARM C C: Combination    ARM C C: Combination
#>           TRT01A REGION1 STRATA1 STRATA2    BMRKR1 BMRKR2 ITTFL SAFFL BMEASIFL
#> 1      A: Drug X    Asia       C      S2 14.424934 MEDIUM     Y     Y        Y
#> 2 C: Combination    Asia       C      S1  4.055463    LOW     Y     Y        N
#> 3 C: Combination Eurasia       A      S1  2.803240   HIGH     Y     Y        Y
#> 4     B: Placebo    Asia       B      S2 10.262734 MEDIUM     Y     Y        Y
#> 5     B: Placebo    Asia       C      S1  6.206763    LOW     Y     Y        N
#> 6 C: Combination    Asia       C      S2  6.906799 MEDIUM     Y     Y        Y
#>   BEP01FL     RANDDT             TRTSDTM             TRTEDTM       EOSSTT
#> 1       Y 2019-02-22 2019-02-24 11:09:18 2021-02-23 22:47:42    COMPLETED
#> 2       N 2019-02-26 2019-02-26 09:05:00 2021-02-25 20:43:24    COMPLETED
#> 3       N 2019-02-24 2019-02-28 03:19:08 2021-02-27 14:57:32    COMPLETED
#> 4       Y 2019-02-27 2019-03-01 13:33:03 2021-03-01 01:11:27    COMPLETED
#> 5       N 2019-03-01 2019-03-02 00:09:16 2021-03-01 11:47:40    COMPLETED
#> 6       N 2019-03-05 2019-03-05 15:23:44 2021-02-17 20:23:53 DISCONTINUED
#>         EOTSTT      EOSDT EOSDY          DCSREAS DTHDT DTHCAUS DTHCAT LDDTHELD
#> 1    COMPLETED 2021-02-23   731             <NA>  <NA>    <NA>   <NA>       NA
#> 2    COMPLETED 2021-02-25   731             <NA>  <NA>    <NA>   <NA>       NA
#> 3    COMPLETED 2021-02-27   731             <NA>  <NA>    <NA>   <NA>       NA
#> 4    COMPLETED 2021-03-01   731             <NA>  <NA>    <NA>   <NA>       NA
#> 5    COMPLETED 2021-03-01   731             <NA>  <NA>    <NA>   <NA>       NA
#> 6 DISCONTINUED 2021-02-17   716 LACK OF EFFICACY  <NA>    <NA>   <NA>       NA
#>   LDDTHGR1   LSTALVDT DTHADY study_duration_secs
#> 1     <NA> 2021-03-05     NA            63113904
#> 2     <NA> 2021-03-15     NA            63113904
#> 3     <NA> 2021-03-15     NA            63113904
#> 4     <NA> 2021-03-17     NA            63113904
#> 5     <NA> 2021-03-25     NA            63113904
#> 6     <NA> 2021-03-01     NA            63113904
#> 
#> ...
#> # A tibble: 6 × 44
#>   STUDYID USUBJID     SUBJID SITEID   AGE AGEU  SEX   RACE  ETHNIC COUNTRY DTHFL
#>   <chr>   <chr>       <chr>  <chr>  <int> <fct> <fct> <fct> <fct>  <fct>   <fct>
#> 1 AB12345 AB12345-CH… id-11  CHN-9     28 YEARS F     NATI… HISPA… CHN     N    
#> 2 AB12345 AB12345-CH… id-352 CHN-16    28 YEARS M     ASIAN HISPA… CHN     N    
#> 3 AB12345 AB12345-CH… id-186 CHN-1     27 YEARS M     ASIAN NOT H… CHN     N    
#> 4 AB12345 AB12345-CH… id-371 CHN-1     28 YEARS F     ASIAN NOT H… CHN     Y    
#> 5 AB12345 AB12345-CH… id-233 CHN-1     36 YEARS F     BLAC… NOT H… CHN     N    
#> 6 AB12345 AB12345-US… id-131 USA-12    44 YEARS F     AMER… NOT H… USA     N    
#> # … with 33 more variables: INVID <chr>, INVNAM <chr>, ARM <fct>, ARMCD <fct>,
#> #   ACTARM <fct>, ACTARMCD <fct>, TRT01P <fct>, TRT01A <fct>, REGION1 <fct>,
#> #   STRATA1 <fct>, STRATA2 <fct>, BMRKR1 <dbl>, BMRKR2 <fct>, ITTFL <fct>,
#> #   SAFFL <fct>, BMEASIFL <fct>, BEP01FL <fct>, RANDDT <date>, TRTSDTM <dttm>,
#> #   TRTEDTM <dttm>, EOSSTT <fct>, EOTSTT <fct>, EOSDT <date>, EOSDY <int>,
#> #   DCSREAS <fct>, DTHDT <date>, DTHCAUS <fct>, DTHCAT <fct>, LDDTHELD <int>,
#> #   LDDTHGR1 <fct>, LSTALVDT <date>, DTHADY <int>, study_duration_secs <dbl>

This make some of the documentation (such as vignettes on data level with multiple datasets) super-super long.

Please also check other print methods in the child classes - we might make this mistake there as well.

gogonzo commented 2 years ago

Vignettes are fixed https://github.com/insightsengineering/teal.data/issues/65

I'm moving the issue to the backlog as request refers to change in the code

nikolas-burkoff commented 2 years ago

Do we really care about the head and tail when printing - it's just confusing with the line numbers in the tail part...

Could we change:

print(head(as.data.frame(self$get_raw_data())))
if (self$get_nrow() > 6) {
  cat("\n...\n")
  print(tail(self$get_raw_data()))
}

to something like this?

print(head(as.data.frame(self$get_raw_data())))
if (self$get_nrow() > 6) {  
  cat("\n...\n")
} 
donyunardi commented 2 years ago

I agree to make this simpler and just print the head.

chlebowa commented 2 years ago

Is this issue dead?

Two points:

  1. There is an inconsistency here: the raw data (which is a tibble) is converted to data.frame for the head call, while tail is called directly on the tibble. Printing a tibble drops columns to fit neatly in the console, it's the endless wrapping of a data.frame that causes the output to be (potentially) very long.

  2. The confusing row numbers result from the fact that the raw data is a tibble and the print method for tibble ignores row names and just shows integer indices for the displayed subset. This is not the case for data.frame:

    
    > iris[50:51, ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    50            5         3.3          1.4         0.2     setosa
    51            7         3.2          4.7         1.4 versicolor

tibble::tibble(iris)[50:51, ]

A tibble: 2 x 5

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5 3.3 1.4 0.2 setosa 2 7 3.2 4.7 1.4 versicolor ```

I presume the intention was to obtain something like data.table does:

> data.table::as.data.table(iris)
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

Here is how this happens:

 toprint = rbind(head(toprint, topn + isTRUE(class)), 
            `---` = "", tail(toprint, topn))
        rownames(toprint) = format(rownames(toprint), justify = "right")

Note the actual row names are dropped in data table. The numbers above are also ad-hoc indices but they refer to observations in the data set, rather than in the print output like for tibbles.

Also,

print(head(as.data.frame(self$get_raw_data())))
    if (self$get_nrow() > 6) {
        cat("\n...\n")
        print(tail(self$get_raw_data()))
    }

may result in rows being printed twice if there are less than 12.

nikolas-burkoff commented 2 years ago

So I think we should do https://github.com/insightsengineering/teal.data/issues/68#issuecomment-1227456012