[Feature Request]: Standardized results datasets

Feature description

Use case: it’s useful to be able to output tables or plots to a flat / rectangular dataset structure to enable QC via double programming. For very large tables or plots, manual comparison of number is not feasible.

Some wishlist ideas to be refined further…

Add some documentation within a vignette or cheatsheet for best practices
rtables::as_results_df() creates a data.frame which includes list columns.
- What is the best way to flatten them (e.g. tidyr::unnest_longer())?
- Should there be an additional utility to do this?
- What type of datasets should a NULL report generate ?
- Be careful of column name format: SAS programmers reading in the results dataset will get an error if the dataset names are not SAS-friendly (max 32 characters in length. The first character must begin with an alphabetic character or an underscore. Subsequent characters can be alphabetic characters, numeric digits, or underscores)
Statistics functions in tern / tern.s should return labelled vectors. That way when cells with multiple values are un-nested, the labels can be used to distinguish what each statistic represents.
Results datasets should also be available for all plots. Extracting data from grobs / ggplot2 objects can be tricky and time-consuming.
configure results dataset generation with chevron (should a dataset be generated, where should it saved, etc).

Code of Conduct

[X] I agree to follow this project's Code of Conduct.

Contribution Guidelines

[X] I agree to follow this project's Contribution Guidelines.

Security Policy

[X] I agree to follow this project's Security Policy.

Feature description

Use case: it’s useful to be able to output tables or plots to a flat / rectangular dataset structure to enable QC via double programming. For very large tables or plots, manual comparison of number is not feasible.

Some wishlist ideas to be refined further…

Add some documentation within a vignette or cheatsheet for best practices

Do you have a general document for best practices in QC that we could include? I don't know which are the best practices here too, and writing the vignette w/o clear reference is not advisable imo.

rtables::as_results_df() creates a data.frame which includes list columns.

What is the best way to flatten them (e.g. tidyr::unnest_longer())?

Should there be an additional utility to do this?

I think as_result_df produces a data.frame with all columns which is a list of columns, but it has more information regarding the row lines. Should we add an option to get rid of those and keep only the columns?

What type of datasets should a NULL report generate ?

@BFalquet have you something already in?

Be careful of column name format: SAS programmers reading in the results dataset will get an error if the dataset names are not SAS-friendly (max 32 characters in length. The first character must begin with an alphabetic character or an underscore. Subsequent characters can be alphabetic characters, numeric digits, or underscores)

It is all respected, but the characters limit, which I do not see the need for nowadays with current machines and UTF-8 standards.

Statistics functions in tern / tern.s should return labelled vectors. That way when cells with multiple values are un-nested, the labels can be used to distinguish what each statistic represents.

Maybe @edelarua knows best, but I think it is like this in {tern}, but when it is tabulated, this direct relation is lost, i.e. it is maintained by the inserted labels and so on.

Results datasets should also be available for all plots. Extracting data from grobs / ggplot2 objects can be tricky and time-consuming.

This could be achieved by attaching metadata to the pdf, maybe? Extracting data from ggplot objects can be done with ggplot_build(*)$data[[1]]. Do we have something in the pipeline regarding this @BFalquet?

configure results dataset generation with chevron (should a dataset be generated, where should it saved, etc).

I think this makes total sense, and it should be in place to some extent. Right, @BFalquet?

Regarding your previous point @Melkiades, for the NULL report, the current behavior is:

> null_report

#> ——————————————————————————————————————————————————————
#>   Null Report: No observations met the reporting criteria for inclusion in this output.

> as_result_df(null_report)
#>  avar_name row_name row_num is_group_summary node_class
#> 1                          1            FALSE    DataRow
#>                                                                               cellvals
#> 1 Null Report: No observations met the reporting criteria for inclusion in this output.

We haven't decided on a solution for the plot data yet. You could change the plotting function to attach it to the resulting object (not to the pdf) and then creating a getter function (conveniently called as_result_df for instance). I think that we should revisit this question when all plots will be based on ggplot2. Maybe, in that case the layer_data function could be enough, What do you think ?

Statistics functions in tern / tern.s should return labelled vectors. That way when cells with multiple values are un-nested, the labels can be used to distinguish what each statistic represents.

Maybe @edelarua knows best, but I think it is like this in {tern}, but when it is tabulated, this direct relation is lost, i.e. it is maintained by the inserted labels and so on.

@Melkiades, most (probably all) tern statistic functions currently return named lists of statistics, which I think should be fine.

@Melkiades please see my reply here:

Do you have a general document for best practices in QC that we could include? I don't know which are the best practices here too, and writing the vignette w/o clear reference is not advisable imo.

What I had in mind here was something very simple. Just a vignette example showing what functions can be used to extract the results data from a plot or a table that users can quickly review and share.

For full background info I shared the Roche QC guidelines on the NEST SME gDrive since those are confidential. Feel free to message me directly with any questions.

I think as_result_df produces a data.frame with all columns which is a list of columns, but it has more information regarding the row lines. Should we add an option to get rid of those and keep only the columns?

This is not as important as some of the other things on this feature request.

It is all respected, but the characters limit, which I do not see the need for nowadays with current machines and UTF-8 standards.

I was thinking something simple like a warning similar to names_repair option in tidy::unnest. Just to give a warning or note if column names don’t conform to the criteria above and to give an option to repair them.

Right now the column labels in rtables are very flexible. In the example below column “A+C” would be a valid name for rtables but the exported result would be impossible to be read in by SAS. It would be nice to warn users in such cases before the results dataset is created.

library(rtables)

result <- basic_table() %>%
  add_overall_col("A+C") %>%
  analyze("AGE") %>%
  build_table(DM)

as_result_df(result)

> as_result_df(result)
  avar_name row_name row_num is_group_summary node_class      A+C
1       AGE     Mean       1            FALSE    DataRow 34.22191

Statistics functions in tern / tern.s should return labelled vectors. That way when cells with multiple values are un-nested, the labels can be used to distinguish what each statistic represents.

There is definitely room for improvement. Here is a basic example showing how occasionally labels are present. I think all tern functions return named lists but here I'm asking for named vectors. Note how only s_summary.numeric has the vector labels. With the labels it would be clear that in the categorical summaries for the ARM variable the first number is the count and the second is the percent.

> # Summary of numeric vector has name
> s_summary(DM$AGE)$n
  n 
356 
> 
> # Summary of categorical vector is not named
> s_summary(DM$ARM)$n
[1] 356
> 
> result <- basic_table() %>%
+   analyze_vars(vars = c("AGE", "ARM")) %>%
+   build_table(DM) %>%
+   as_result_df()

Screen Shot 2024-02-07 at 1 30 16 PM

insightsengineering / tern