Support specifying a subset of rows in `GT.data_color()`

jrycw commented 1 month ago

Related to #248.

This PR aims to support specifying a subset of rows in GT.data_color(). Based on the discussions, it appears that the most common use case involves excluding the last row, which is typically the total row. Consequently, rows that are not selected should not be included in the color "calculation." In this PR, I first select only the specified columns and rows, then "re-map" the "calculated color" back to the original table.

A simple example might illustrate the concept well:

import polars as pl
import polars.selectors as cs
from great_tables import GT

df = pl.DataFrame({"a": [1, 2, 3, 4, 5, 200], "b": [51, 52, 53, 54, 55, 200]})

GT(df).data_color(columns=cs.numeric(), rows=pl.col("a").lt(6), palette="PuBu")

GT(df).data_color(columns=cs.numeric(), rows=pl.col("a").lt(300), palette="PuBu")

Bonus

In addition, I have added support for using Polars-style column and row selection. This allows for more flexible operations, such as:

GT(df).data_color(columns=cs.starts_with("b"), rows=pl.col("a").lt(6), palette="PuBu")

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.08%. Comparing base (e3649d2) to head (a683850). Report is 20 commits behind head on main.

:exclamation: Current head a683850 differs from pull request most recent head 610542e

Please upload reports for the commit 610542e to get more accurate results.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #364 +/- ## ========================================== + Coverage 85.90% 86.08% +0.17% ========================================== Files 41 41 Lines 4328 4326 -2 ========================================== + Hits 3718 3724 +6 + Misses 610 602 -8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jrycw commented 1 month ago

Please hold off on merging this PR for now. I've discovered some personal notes in my Jupyter playbook explaining why I wrote the code this way. It appears that the logic between these two implementations may differ. I'll conduct further investigation soon.

jrycw commented 1 month ago

I think both approaches should yield the same result. However, a more interesting question arises: what should users expect if we include the rows parameter? For instance, if rows = [1, 3, 4], should we first calculate the palette for the entire column and then retrieve the corresponding colors for those rows, or should we calculate the palette based solely on the selected rows? Both cases might not look aesthetically pleasing, and I doubt their practical application.

Perhaps what we really need is an option like include_last_row=True or include_last_row=False (the parameter name might not be optimal, but you get the idea). I've implemented alternative logic in the last commit, although this logic might need adjustments in the future if we introduce something like GT.summary_rows(), which could potentially affect more than one row.

import polars as pl
import polars.selectors as cs
from great_tables import GT

df = pl.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, sum([1, 2, 3, 4, 5])],
        "b": [51, 52, 53, 54, 55, sum([51, 52, 53, 54, 55])],
    }
)

GT(df).data_color(columns=cs.numeric(), include_last_row=True, palette="PuBu")

GT(df).data_color(columns=cs.numeric(), include_last_row=False, palette="PuBu")

machow commented 1 month ago

should we first calculate the palette for the entire column and then retrieve the corresponding colors for those rows, or should we calculate the palette based solely on the selected rows?

@rich-iannone do you know what happens in the gt R package? AFAICT data_color there doesn't use excluded rows, and the rationale is that this let'd you exclude extremely large values etc..

I would lean toward not using excluded in the calculations. Then, if people want to use all rows, they can always set the domain= argument, to specify the exact range for the color values.

rich-iannone commented 1 month ago

should we first calculate the palette for the entire column and then retrieve the corresponding colors for those rows, or should we calculate the palette based solely on the selected rows?

@rich-iannone do you know what happens in the gt R package? AFAICT data_color there doesn't use excluded rows, and the rationale is that this let'd you exclude extremely large values etc..

I would lean toward not using excluded in the calculations. Then, if people want to use all rows, they can always set the domain= argument, to specify the exact range for the color values.

The R package doesn't use the excluded rows in the calculations or colorization. You're right that the domain argument approach is better for specifying the exact range. If we did include all the rows in the calculation, it might be surprising to users and I don't think it would be generally desired behavior through an extra argument.

jrycw commented 1 month ago

Thanks @machow and @rich-iannone . Would 43b0ce5 achieve our goal, or are there any other modifications I should make?

rich-iannone commented 1 month ago

I think so. Just reverting the last commit would make this right.

posit-dev / great-tables

Support specifying a subset of rows in `GT.data_color()` #364

Bonus

Codecov Report