Closed brendanjodowd closed 2 years ago
Here is a sample workflow of what I would like to be able to do. Let's say I have a product inventory dataset called "supplies". Different products belong to different categories. I would like to calculate the number of each product, as well as the number in each category, and the total number across all categories.
category | product | count |
---|---|---|
Paint | Red | 20 |
Paint | White | 100 |
Paint | Green | 30 |
Tools | Brush | 100 |
Tools | Roller | 80 |
Tools | Tray | 70 |
I am going to use an imaginary function called factor_fxn( )
which allows me to apply a label like "Tools" to several products.
summary_statistics <- supplies %>%
mutate(my_factor= factor_fxn(product , labels =
"Red" ~ "Red paint",
"White" ~ "White paint",
"Green" ~ "Green paint",
"Brush" ~ "Brush",
"Roller" ~ "Roller",
"Tray" ~ "Tray",
"Red", "White", "Green" ~ "Paint",
"Brush", "Roller", "Tray" ~ "Tools",
"Red","White","Green","Brush","Roller","Tray" ~ "All products" )) %>%
group_by(my_factor) %>%
summarise(Total = sum(count))
And this gives me the output that I wanted. More neatly, perhaps, I might have defined only the labels "Paint", "Tools" and "All products", which would output just the last three rows in the following.
my_factor | Total |
---|---|
Red paint | 20 |
White paint | 100 |
Green paint | 30 |
Brush | 100 |
Roller | 80 |
Tray | 70 |
Paint | 150 |
Tools | 250 |
All products | 400 |
In my imagined function, the list of products belonging to each group could be written as above (comma separated, quotation marks for strings), or as a vector, so that a vector could be calculated prior to the factor definition and slotted in as one of the ranges which make up a group label. For example you might use product_list <- unique(supplies$product)
, and then slot this vector into the factor definition for "All products" instead of writing "Red","White","Green","Brush","Roller","Tray"
.
Here is another example where my imaginary function can apply labels to ranges of numeric values. I have a dataset of employee wages called "salaries". I want to calculate the average salary within three different salary bands, as well as the average for all employees.
employee | salary |
---|---|
Mary | 60000 |
Tim | 15000 |
Joe | 45000 |
Ellen | 40000 |
Steve | 18000 |
summary_statistics <- salaries %>%
mutate(income_level = factor_fxn(salary, labels =
0 -< 20000 ~ "Low",
20000 -< 50000 ~ "Medium",
50000 - high ~ "High",
0 - high ~ "All income levels" )) %>%
group_by(income_level) %>%
summarise(avg_salary = mean(salary))
There is some extra imaginary syntax used in the above to distinguish between ranges that are "up to and including" and "up to and excluding". I also use the syntax high
to indicate to R that there is no upper bound to this range. This is taken from SAS, which uses: -
, <-
, -<
, <-<
, low
, high
.
This would be the output:
income_level | avg_salary |
---|---|
Low | 16500 |
Medium | 42500 |
High | 60000 |
All income levels | 35600 |
SAS also has other
which captures any values not already assigned a label, and missing
which captures missing values. These are both really handy. In the example below I'm imagining another syntax option all
.
students %>%
mutate(nationality = factor_fxn(country_code, labels =
"USA" ~ "United States",
"CAN", "MEX" ~ "Other North America",
missing ~ "Unknown",
other ~ "All other countries",
all ~ "All countries of origin" ))
What value should be stored in a cell which is created by factor_fxn( )
? My sense is that this is non-trivial. Consider a column subject
which contains maths
, english
etc., and let's say that we convert this into a factor, using factor_fxn()
, wherein maths
is assigned as part of a group of subjects to the label STEM
but also assigned to Science
as part of another (overlapping) group of subjects. Maybe english
is also assigned to two labels, Languages
and Humanities
. Both maths
and english
are furthermore assigned to All subjects
. But neither is assigned to itself, i.e. there is no line in the factor definition that reads: "english" ~ "English"
or "maths" ~ "Mathematics"
So what should be shown in the column subject
if we view the dataframe? My guess is that the only option is the unformatted text, i.e. maths
, english
. Otherwise there would be a complicated representation of the different groups/labels of which these subjects are members. As such, it would not be possible to perform operations on the label representation of the subject, e.g. you could not use filter(subject == "STEM")
, and it would still be possible to filter maths
using the string pattern "mat"
, for example. This is somewhat disappointing, but I can't think of a convenient way around it.
However, when the factor is used with a summary function, the result must contain the label strings as values. E.g. if we count the number of students in each subject (where subject is converted to a factor as described above), then the line for STEM
should be composed of the string "STEM"
only, and constituent subjects such as maths
, physics
etc. may at that point be safely forgotten. I think!
It seems like you want this to create new rows? I think that unfortunately puts it out of scope for forcats, since there's no way that modifying a factor could affect all the other variables in the data frame.
I agree that it is out of scope for forcats, but just wanted to mention that I think partially overlapping levels is an interesting and potentially very useful data structure. It’s similar to (the IMHO underused) ‘shingles’ data structure in the ‘lattice’ package: https://www.rdocumentation.org/packages/lattice/versions/0.17-25/topics/shingles
Hi! Oftentimes when I use
summarise( )
over a category I also need a total, e.g. males, females and total. Sometimes I require subtotals for things like NACE industry sectors and NUTS regions. I have a hard time doing this in R, if I want a total line I have to run thesummarise( )
twice, with and without the relevant category, and then bind the two sets of results. If I want several subtotals I have to runsummarise( )
once for each subgroup, and bind each resulting dataset. In SAS, totals and subtotals are accomplished using multilabel formats, which, as well as allowing the user to create labels for ranges of values, allows you to assign values to more than one label. E.g. "M" belongs to "Male" and "F" belongs to "Female", but "M" and "F" both belong to "Both sexes". Is there any scope for multilabel/multilevel factors in R, which would then allow the creation of summary statistics for groups of categories? The syntax for creating multilabel formats in SAS is not unlike the syntax used incase_when( )
.