Closed ekothe closed 3 years ago
As far as I can see it, the problem probably emerges from the core method in the genderRecode function.
As it presently operates this essentially:
We might resolve this problem by:
Obviously 2 is easier than 1, but 1 is ultimately "better" for efficiency of the function and predictability.
Thanks @Lingtax I agree that 1 would be a better long term fix.
@njtierney do you have any suggestions? I think the join could directly happen on the original table and that would make it more efficient. I'm unsure about the best way to deal with the name clashes. If the input and output column names match, delete the input column before the join?
@Lingtax Isn't the join using the input column? If so it can't delete it before the join.
OK so I haven't looked through the source code, this may (or may not) be helpful:
spelling_list <- list(
mael = "male",
mail = "male",
enby = "non-binary",
femail = "female",
female = "female",
femael = "female",
male = "male"
)
df <- data.frame(stringsAsFactors=FALSE,
gender = c("male", "MALE", "mle", "I am male", "femail", "female", "enby"),
age = c(34L, 37L, 77L, 52L, 68L, 67L, 83L)
)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# look up list
spelling_list[df$gender]
#> $male
#> [1] "male"
#>
#> $<NA>
#> NULL
#>
#> $<NA>
#> NULL
#>
#> $<NA>
#> NULL
#>
#> $femail
#> [1] "female"
#>
#> $female
#> [1] "female"
#>
#> $enby
#> [1] "non-binary"
df %>%
mutate(recoded_gender = spelling_list[gender])
#> gender age recoded_gender
#> 1 male 34 male
#> 2 MALE 37 NULL
#> 3 mle 77 NULL
#> 4 I am male 52 NULL
#> 5 femail 68 female
#> 6 female 67 female
#> 7 enby 83 non-binary
# turn it into a function that handles non-matches
recode_gender <- function(x){
recoded_list <- spelling_list[x]
recoded_list
}
recode_gender(df$gender)
#> $male
#> [1] "male"
#>
#> $<NA>
#> NULL
#>
#> $<NA>
#> NULL
#>
#> $<NA>
#> NULL
#>
#> $femail
#> [1] "female"
#>
#> $female
#> [1] "female"
#>
#> $enby
#> [1] "non-binary"
# need to handle non-matches - perhaps using approach from syn (https://github.com/ropenscilabs/syn/blob/master/R/syn.R#L19)
Created on 2018-12-14 by the reprex package (v0.2.1)
Handling non-matches needs to be handled in a different way.
I think duplicate names can be handled in the new features of tibble 2.0.0 https://www.tidyverse.org/articles/2018/11/tibble-2.0.0-pre-announce/
Also, is the data returned supposed to be a grouped_df
?
So I was a bit rushed with my last suggestion, I think I understand the problem better now. This was an interesting problem to think about, and I ended up spending a bit of time working on a different approach - what follows is a bit of a lengthy explanation for how I would approach this problem, which might not suit what you had planned, so please feel free to do your own thing! :)
My solution to this would be to sidestep the issue by providing two types of functions
This means that you avoid the issue of name clashed by getting the user to decide the name (option 1), or they can use the syntactic sugar of option 2 and get the default name.
The way I implemented this is a little bit different to the join approach in gendercodeR
- I used a list of dictionary names instead of a join. Lists are quite fast in R, but I'm not sure if this is any faster or better than your current join approach.
So first there is the setup of the data:
df <- data.frame(stringsAsFactors=FALSE,
gender = c("male", "MALE", "mle", "I am male", "femail", "female", "enby"),
age = c(34L, 37L, 77L, 52L, 68L, 67L, 83L)
)
This is the example dictionary I have provided
example_dictionary <- list(
mael = "male",
mail = "male",
enby = "non-binary",
femail = "female",
female = "female",
femael = "female",
male = "male"
)
This is how you can use the list to get the returned names.
# look up list
example_dictionary[df$gender]
#> $male
#> [1] "male"
#>
#> $<NA>
#> NULL
#>
#> $<NA>
#> NULL
#>
#> $<NA>
#> NULL
#>
#> $femail
#> [1] "female"
#>
#> $female
#> [1] "female"
#>
#> $enby
#> [1] "non-binary"
And this is how I could use this in dplyr:
suppressPackageStartupMessages(library(dplyr))
# use in dplyr
df %>%
mutate(recoded_gender = example_dictionary[gender])
#> gender age recoded_gender
#> 1 male 34 male
#> 2 MALE 37 NULL
#> 3 mle 77 NULL
#> 4 I am male 52 NULL
#> 5 femail 68 female
#> 6 female 67 female
#> 7 enby 83 non-binary
This doesn't handle NULL cases like gendercodeR
does, so let's handle that
# return which items in the list are missing
which_is_na <- function(x){
which(is.na(names(x)))
}
# turn it into a function that handles non-matches
recode_gender <- function(x,
dictionary = example_dictionary){ # you would set this to point to your dictionary
recoded_list <- dictionary[x]
# replace missing values with inputs
recoded_list[which_is_na(recoded_list)] <- x[which_is_na(recoded_list)]
# return the values of the named list
purrr::flatten_chr(recoded_list)
}
Now the user can use this directly
# use directly
recode_gender(df$gender)
#> [1] "male" "MALE" "mle" "I am male" "female"
#> [6] "female" "non-binary"
Or they could use it in dplyr
# use in dplyr
df %>%
mutate(recoded_gender = recode_gender(gender))
#> gender age recoded_gender
#> 1 male 34 male
#> 2 MALE 37 MALE
#> 3 mle 77 mle
#> 4 I am male 52 I am male
#> 5 femail 68 female
#> 6 female 67 female
#> 7 enby 83 non-binary
But sometimes it is nice to be able to add the column more directly, here is a way to have that syntactic sugar in a dplyr-style way:
# add some syntactic sugar as an "add_" function. ------------------------------
add_recoded_gender <- function(.data, x){
# capture input of x
var <- rlang::enquo(x)
.data %>%
dplyr::mutate(.recoded_gender = recode_gender(!!var))
}
Which looks like this:
df %>%
add_recoded_gender(gender)
#> gender age .recoded_gender
#> 1 male 34 male
#> 2 MALE 37 MALE
#> 3 mle 77 mle
#> 4 I am male 52 I am male
#> 5 femail 68 female
#> 6 female 67 female
#> 7 enby 83 non-binary
Created on 2018-12-14 by the reprex package (v0.2.1)
Some other notes on implementing this approach would be that your custom dictionaries can be specified as lists by the user. To record information about which names were not matched, you could look into the problems()
approach that readr
uses
Just wanted to update that I looked into the speed of this approach and it looks like using the list lookup is about 10-50x faster than using a join.
I've refactored using this approach in the flight-of-fancy branch.
However since this does not maintain backwards comparability with the previous approach I'd love your opinions on this @rhydwyn @jlbeaudry @fsingletonthorn - I think it's stronger and as @njtierney notes it also has speed improvements.
Seems good to me - that's a very neat trick @njtierney and it looks v. well implemented @ekothe!
Glad you liked it. Your implementation looks great!
One thing is that you might want to consider using snake_case
internally - e.g., here: https://github.com/ropenscilabs/gendercoder/blob/a376d394fe6e3aa6f0a159ced019e30f552b6601/R/recode_gender.R#L53
Have you considered submitting this for software review to rOpenSci? I think this would be a great package to submit.
Glad you liked it. Your implementation looks great!
One thing is that you might want to consider using
snake_case
internally - e.g., here:Have you considered submitting this for software review to rOpenSci? I think this would be a great package to submit.
Good pickup @njtierney, that's what happens when you copy things from SO without making sure the styling is right :)
Wanting to submit rOpenSci review is part of what is motivating me to do the refactoring. Although I think we'd also need to think about changing the package name to all lowercase
Looks good to me, Emily. I think you're right about changing the package name, too, Emily. Consistency is pretty key. Let me know if you want some help with testing it out on some of the data sets.
This issue should now be closed, as the original problem is gone. You should probs add @njtierney as a ctb given he came up with a brilliant solve to the core function!
Running genderRecode multiple times results in duplicate versions of the recoded column.
Created on 2018-12-14 by the reprex package (v0.2.0).