Closed michaelaoash closed 6 months ago
Thank you for the feedback, can you provide a data and the desired outputs? I think I get your points but I am not sure it would be 100% correct. I'll work on it right away.
I don't know if I can give you the exact way to do it. But if you want the things you mentioned above, I think you can first select the columns you want using the super fuction select_dt
, and then using the current col_max
(I am not sure if I can make it work in mutate
-like function).
Thank you and these are very helpful. I was not aware of select_dt(), which is an excellent function. The challenge is that I would like to take a data frame that has many numeric columns and run col_max() on only a selection of the columns while preserving the rest of the data frame.
In the example below, the intent is that the column owner_occupied_rt should be ignored by col_max(), which should return only the larger of heat_util_gas_rt or heat_electricity_rt, but owner_occupied_rt should be included in the output dataset.
GEOID owner_occupied_rt heat_util_gas_rt heat_electricity_rt col_max
1: 250138004002 58.536585 82.926829 17.073171 heat_util_gas_rt
2: 250138001012 46.250000 55.416667 23.333333 heat_util_gas_rt
3: 250138015013 96.551724 51.379310 6.551724 owner_occupied_rt
4: 250138021016 59.557344 83.098592 0.000000 heat_util_gas_rt
5: 250138132062 74.267101 63.843648 13.029316 owner_occupied_rt
---
332: 250138137011 85.955056 4.494382 20.224719 owner_occupied_rt
333: 250138002021 80.249110 33.985765 21.174377 owner_occupied_rt
334: 250138016031 66.666667 73.634652 13.182674 heat_util_gas_rt
335: 250138131013 83.613917 28.956229 3.815937 owner_occupied_rt
336: 250138116002 3.846154 23.076923 60.946746 heat_electricity_rt
I think I get the point, so far, try to use the function below:
library(tidyfst)
col_max2 = function(.data,...,.name = "max_col"){
dt = as.data.table(.data)
dt[0] %>% select_dt(...) %>% names() -> dot_string
if(setequal(dot_string,names(dt))){
dt[, (.name) := colnames(.SD)[max.col(.SD, ties.method = "first")],
.SDcols = is.numeric][]
}else{
dt[, (.name) := colnames(.SD)[max.col(.SD, ties.method = "first")],
.SDcols = dot_string][]
}
}
In the "..." part, put the columns you want to select in any way you want as long as it can be used in select_dt
.
I'll update it in the next version of package, which might take some time. Thanks.
Thank you. This is exactly what I had in mind.
col_max() is a useful function that, given a data frame with numeric columns, returns a data frame that includes a new character column holding the name of the numeric column with the largest value in each row. While it is not documented, col_max() appears to ignore and retain existing character columns from the old data frame in the returned new data frame.
For example, given a dataset of the shares of heating types by Census tract with a character identifier of the tract, it might be useful to know the most prevalent heating type:
(Electric heat is the most prevalent in observation 2; gas heat is the most prevalent in the other 4 Census tracts.)
It would be helpful to have the col_max() configured (1) to take a specified list of columns rather than all the numeric columns in the supplied data frame (for example, there may be additional columns in this dataset describing features of the Census tract that have nothing to do with heat source); and (2) to return the column holding the name of the column holding the maximum value, rather than the entire data frame with the new column appended. In effect, it would be helpful to be able to use col_max() in mutate() and on only a specified set of numeric columns. I hope that I have expressed the issue clearly. Please follow up if I can clarify. Thanks.