tidyverse / forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
https://forcats.tidyverse.org/
Other
554 stars 126 forks source link

Feature request: Multilabel/multilevel factors #282

Closed brendanjodowd closed 2 years ago

brendanjodowd commented 3 years ago

Hi! Oftentimes when I use summarise( ) over a category I also need a total, e.g. males, females and total. Sometimes I require subtotals for things like NACE industry sectors and NUTS regions. I have a hard time doing this in R, if I want a total line I have to run the summarise( ) twice, with and without the relevant category, and then bind the two sets of results. If I want several subtotals I have to run summarise( ) once for each subgroup, and bind each resulting dataset. In SAS, totals and subtotals are accomplished using multilabel formats, which, as well as allowing the user to create labels for ranges of values, allows you to assign values to more than one label. E.g. "M" belongs to "Male" and "F" belongs to "Female", but "M" and "F" both belong to "Both sexes". Is there any scope for multilabel/multilevel factors in R, which would then allow the creation of summary statistics for groups of categories? The syntax for creating multilabel formats in SAS is not unlike the syntax used in case_when( ).

brendanjodowd commented 3 years ago

Here is a sample workflow of what I would like to be able to do. Let's say I have a product inventory dataset called "supplies". Different products belong to different categories. I would like to calculate the number of each product, as well as the number in each category, and the total number across all categories.

category product count
Paint Red 20
Paint White 100
Paint Green 30
Tools Brush 100
Tools Roller 80
Tools Tray 70

I am going to use an imaginary function called factor_fxn( ) which allows me to apply a label like "Tools" to several products.

summary_statistics <- supplies %>% 
  mutate(my_factor= factor_fxn(product , labels = 
    "Red" ~ "Red paint",
    "White" ~ "White paint",
    "Green" ~ "Green paint",
    "Brush" ~ "Brush",
    "Roller" ~ "Roller",
    "Tray" ~ "Tray",
    "Red", "White", "Green" ~ "Paint",
    "Brush", "Roller", "Tray" ~ "Tools",
    "Red","White","Green","Brush","Roller","Tray" ~ "All products" )) %>%
  group_by(my_factor) %>%
  summarise(Total = sum(count))

And this gives me the output that I wanted. More neatly, perhaps, I might have defined only the labels "Paint", "Tools" and "All products", which would output just the last three rows in the following.

my_factor Total
Red paint 20
White paint 100
Green paint 30
Brush 100
Roller 80
Tray 70
Paint 150
Tools 250
All products 400

In my imagined function, the list of products belonging to each group could be written as above (comma separated, quotation marks for strings), or as a vector, so that a vector could be calculated prior to the factor definition and slotted in as one of the ranges which make up a group label. For example you might use product_list <- unique(supplies$product), and then slot this vector into the factor definition for "All products" instead of writing "Red","White","Green","Brush","Roller","Tray".


Here is another example where my imaginary function can apply labels to ranges of numeric values. I have a dataset of employee wages called "salaries". I want to calculate the average salary within three different salary bands, as well as the average for all employees.

employee salary
Mary 60000
Tim 15000
Joe 45000
Ellen 40000
Steve 18000
summary_statistics <- salaries %>% 
  mutate(income_level = factor_fxn(salary, labels =  
    0 -< 20000 ~ "Low",
    20000 -< 50000 ~ "Medium",
    50000 - high ~ "High",
    0 - high ~ "All income levels" )) %>%
  group_by(income_level) %>%
  summarise(avg_salary = mean(salary))

There is some extra imaginary syntax used in the above to distinguish between ranges that are "up to and including" and "up to and excluding". I also use the syntax high to indicate to R that there is no upper bound to this range. This is taken from SAS, which uses: - , <- , -< , <-< , low , high. This would be the output:

income_level avg_salary
Low 16500
Medium 42500
High 60000
All income levels 35600

SAS also has other which captures any values not already assigned a label, and missing which captures missing values. These are both really handy. In the example below I'm imagining another syntax option all.

students %>%
  mutate(nationality = factor_fxn(country_code, labels =
    "USA" ~ "United States",
    "CAN", "MEX" ~ "Other North America",
    missing ~ "Unknown",
    other ~ "All other countries",
    all ~ "All countries of origin" ))  
brendanjodowd commented 3 years ago

What value should be stored in a cell which is created by factor_fxn( )? My sense is that this is non-trivial. Consider a column subject which contains maths, english etc., and let's say that we convert this into a factor, using factor_fxn(), wherein maths is assigned as part of a group of subjects to the label STEM but also assigned to Science as part of another (overlapping) group of subjects. Maybe english is also assigned to two labels, Languages and Humanities. Both maths and english are furthermore assigned to All subjects. But neither is assigned to itself, i.e. there is no line in the factor definition that reads: "english" ~ "English" or "maths" ~ "Mathematics"

So what should be shown in the column subject if we view the dataframe? My guess is that the only option is the unformatted text, i.e. maths, english. Otherwise there would be a complicated representation of the different groups/labels of which these subjects are members. As such, it would not be possible to perform operations on the label representation of the subject, e.g. you could not use filter(subject == "STEM"), and it would still be possible to filter maths using the string pattern "mat", for example. This is somewhat disappointing, but I can't think of a convenient way around it.

However, when the factor is used with a summary function, the result must contain the label strings as values. E.g. if we count the number of students in each subject (where subject is converted to a factor as described above), then the line for STEM should be composed of the string "STEM" only, and constituent subjects such as maths, physics etc. may at that point be safely forgotten. I think!

hadley commented 2 years ago

It seems like you want this to create new rows? I think that unfortunately puts it out of scope for forcats, since there's no way that modifying a factor could affect all the other variables in the data frame.

huftis commented 2 years ago

I agree that it is out of scope for forcats, but just wanted to mention that I think partially overlapping levels is an interesting and potentially very useful data structure. It’s similar to (the IMHO underused) ‘shingles’ data structure in the ‘lattice’ package: https://www.rdocumentation.org/packages/lattice/versions/0.17-25/topics/shingles