hope-data-science / tidyfst

Tidy Verbs for Fast Data Manipulation
https://hope-data-science.github.io/tidyfst/
Other
96 stars 8 forks source link

col_max() that can be used with dplyr::mutate() #26

Closed michaelaoash closed 6 months ago

michaelaoash commented 6 months ago

col_max() is a useful function that, given a data frame with numeric columns, returns a data frame that includes a new character column holding the name of the numeric column with the largest value in each row. While it is not documented, col_max() appears to ignore and retain existing character columns from the old data frame in the returned new data frame.

For example, given a dataset of the shares of heating types by Census tract with a character identifier of the tract, it might be useful to know the most prevalent heating type:

          GEOID    heat_util_gas    heat_electricity
1   25013810901       44.9122807           32.631579
2   25013811400       28.6046512           63.604651
3   25013800300       61.5200479            7.839617
4   25013801402       50.5154639           18.041237
5   25013812201       47.8042086           34.903934
> census_data %>% select(GEOID, heat_util_gas, heat_electricity) %>% tidyfst::col_max()
           GEOID    heat_util_gas    heat_electricity          max_col
  1: 25013810901         44.91228           32.631579    heat_util_gas
  2: 25013811400         28.60465           63.604651 heat_electricity
  3: 25013800300         61.52005            7.839617    heat_util_gas
  4: 25013801402         50.51546           18.041237    heat_util_gas
  5: 25013812201         47.80421           34.903934    heat_util_gas

(Electric heat is the most prevalent in observation 2; gas heat is the most prevalent in the other 4 Census tracts.)

It would be helpful to have the col_max() configured (1) to take a specified list of columns rather than all the numeric columns in the supplied data frame (for example, there may be additional columns in this dataset describing features of the Census tract that have nothing to do with heat source); and (2) to return the column holding the name of the column holding the maximum value, rather than the entire data frame with the new column appended. In effect, it would be helpful to be able to use col_max() in mutate() and on only a specified set of numeric columns. I hope that I have expressed the issue clearly. Please follow up if I can clarify. Thanks.

hope-data-science commented 6 months ago

Thank you for the feedback, can you provide a data and the desired outputs? I think I get your points but I am not sure it would be 100% correct. I'll work on it right away.

hope-data-science commented 6 months ago

I don't know if I can give you the exact way to do it. But if you want the things you mentioned above, I think you can first select the columns you want using the super fuction select_dt, and then using the current col_max (I am not sure if I can make it work in mutate-like function).

michaelaoash commented 6 months ago

Thank you and these are very helpful. I was not aware of select_dt(), which is an excellent function. The challenge is that I would like to take a data frame that has many numeric columns and run col_max() on only a selection of the columns while preserving the rest of the data frame.

In the example below, the intent is that the column owner_occupied_rt should be ignored by col_max(), which should return only the larger of heat_util_gas_rt or heat_electricity_rt, but owner_occupied_rt should be included in the output dataset.

            GEOID owner_occupied_rt heat_util_gas_rt heat_electricity_rt   col_max
  1: 250138004002         58.536585        82.926829           17.073171    heat_util_gas_rt
  2: 250138001012         46.250000        55.416667           23.333333    heat_util_gas_rt
  3: 250138015013         96.551724        51.379310            6.551724   owner_occupied_rt
  4: 250138021016         59.557344        83.098592            0.000000    heat_util_gas_rt
  5: 250138132062         74.267101        63.843648           13.029316   owner_occupied_rt
 ---                                                                                        
332: 250138137011         85.955056         4.494382           20.224719   owner_occupied_rt
333: 250138002021         80.249110        33.985765           21.174377   owner_occupied_rt
334: 250138016031         66.666667        73.634652           13.182674    heat_util_gas_rt
335: 250138131013         83.613917        28.956229            3.815937   owner_occupied_rt
336: 250138116002          3.846154        23.076923           60.946746 heat_electricity_rt
hope-data-science commented 6 months ago

I think I get the point, so far, try to use the function below:

library(tidyfst)
col_max2 = function(.data,...,.name = "max_col"){
  dt = as.data.table(.data)
  dt[0] %>% select_dt(...) %>% names() -> dot_string
  if(setequal(dot_string,names(dt))){
    dt[, (.name) := colnames(.SD)[max.col(.SD, ties.method = "first")],
       .SDcols = is.numeric][]
  }else{
    dt[, (.name) := colnames(.SD)[max.col(.SD, ties.method = "first")],
       .SDcols = dot_string][]
  }
}

In the "..." part, put the columns you want to select in any way you want as long as it can be used in select_dt.

I'll update it in the next version of package, which might take some time. Thanks.

michaelaoash commented 6 months ago

Thank you. This is exactly what I had in mind.