r-lib / pillar

Format columns with colour
https://pillar.r-lib.org/
Other
178 stars 37 forks source link

Feature request: Option to print both head and tail of tables? #651

Open DarwinAwardWinner opened 1 year ago

DarwinAwardWinner commented 1 year ago

The S4Vectors package from Bioconductor implements an S4 class called DataFrame (which exists to allow S4 vectors as data frame columns, I believe). One of the nice features of this class is that when printing, it shows both the first and last few rows of the data frame, e.g.:

library(dplyr)
library(S4Vectors)
as(arrange(mtcars, cyl), "DataFrame")
#> DataFrame with 32 rows and 11 columns
#>                        mpg       cyl      disp        hp      drat        wt
#>                  <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> Datsun 710            22.8         4     108.0        93      3.85     2.320
#> Merc 240D             24.4         4     146.7        62      3.69     3.190
#> Merc 230              22.8         4     140.8        95      3.92     3.150
#> Fiat 128              32.4         4      78.7        66      4.08     2.200
#> Honda Civic           30.4         4      75.7        52      4.93     1.615
#> ...                    ...       ...       ...       ...       ...       ...
#> AMC Javelin           15.2         8       304       150      3.15     3.435
#> Camaro Z28            13.3         8       350       245      3.73     3.840
#> Pontiac Firebird      19.2         8       400       175      3.08     3.845
#> Ford Pantera L        15.8         8       351       264      4.22     3.170
#> Maserati Bora         15.0         8       301       335      3.54     3.570
#>                       qsec        vs        am      gear      carb
#>                  <numeric> <numeric> <numeric> <numeric> <numeric>
#> Datsun 710           18.61         1         1         4         1
#> Merc 240D            20.00         1         0         4         2
#> Merc 230             22.90         1         0         4         2
#> Fiat 128             19.47         1         1         4         1
#> Honda Civic          18.52         1         1         4         2
#> ...                    ...       ...       ...       ...       ...
#> AMC Javelin          17.30         0         0         3         2
#> Camaro Z28           15.41         0         0         3         4
#> Pontiac Firebird     17.05         0         0         3         2
#> Ford Pantera L       14.50         0         1         5         4
#> Maserati Bora        14.60         0         1         5         8

Created on 2023-09-28 with reprex v2.0.2

Would it be possible to implement this as an option in pillar, at least for tables whose tail is easily accessible (i.e. probably not tables representing database queries)? Overall I prefer the formatting of pillar, but often seeing both the head and tail of a table is useful, because if the table is sorted by a particular column, it may not be clear from just the head that this column varies, e.g.:

library(dplyr)
print(as_tibble(arrange(mtcars, cyl)))
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#>  2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#>  3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#>  4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#>  5  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#>  6  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#>  7  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#>  8  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#>  9  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 10  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> # ℹ 22 more rows

Created on 2023-09-28 with reprex v2.0.2

As for implementation, I imagine either a logical option to include the tail, in which case the number of rows to be printed would be split equally; or else a fraction between 0 and 1 indication the desired split of rows between head and tail. But maybe you have better ideas.

DarwinAwardWinner commented 1 year ago

I had a look through the code to see if I could implement this myself, but there were a few too many layers of indirection for me to follow. If you can point me to the appropriate place in the code, I can try implementing this when I have time.

krlmlr commented 1 year ago

Thanks. The prt package implements output in this way, see, e.g., https://github.com/nbenn/prt/blob/main/tests/testthat/_snaps/format.md .

CC @nbenn.

DarwinAwardWinner commented 1 year ago

Interesting. So it looks like I could potentially define my own print method for data frames and/or tibbles that calls prt::format_dt. Is there an easy way to determine if a given tibble's backend supports efficient random access so that I can avoid trying to e.g. get the tail of a database query result?

krlmlr commented 1 year ago

None that I'm aware of, perhaps you could implement some heuristics? Happy to review if you'd be willing to share an implementation.

DarwinAwardWinner commented 1 year ago

I will definitely share if I figure it out. Do you have any opinions on how the options should be set up?

DarwinAwardWinner commented 1 year ago

A minimal implementation for tibbles, meant to be put in ~/.Rprofile:

print.tbl <- function (x, width = NULL, ..., n = NULL, max_extra_cols = NULL, max_footer_lines = NULL) {
    tryCatch({
        n_half <- if(!is.null(n)) ceiling(n/2)
        prt:::cat_line(prt:::format_dt(x = x, ..., n = n_half, width = width, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines))
    }, error = \(...) pillar:::print.tbl(x = x, width = width, ..., n = n, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines))
}

I also came up with something for base data frames, but I print them using the aforementioned S4Vectors code, since the dplyr/pillar stuff doesn't print row names, which can't be ignored for base data frames.

print.data.frame <- function(x, ...) {
    tryCatch({
        withr::with_options(
            list(max.print = ncol(x) * 15),
            S4Vectors:::.show_DataFrame(x)
        )
    }, error = \(...) base::print.data.frame(x = x, ...))
}