ropensci / gendercoder

Creating R package to code free text gender responses
https://docs.ropensci.org/gendercoder/
Other
46 stars 12 forks source link

genderRecode adds additional columns (with duplicate names) in some implementations #21

Closed ekothe closed 3 years ago

ekothe commented 5 years ago

Running genderRecode multiple times results in duplicate versions of the recoded column.

library(gendercodeR)
#> Welcome to the genderCodeR package
#> 
#> This package attempts to remove typos from free text gender data
#> The defaults that we used are specific to our context and your data may be
#> different.We offer two categorisations, board and narrow both are opinionated
#> about how gender descriptors collapse into categories as these are cultrally
#> specific they may not be suitiable for your data. In particularly the narrow
#> setting makes opinionated choices about some responses that we want to
#> acknowledge are potentially problematic.
#>       In particular,
#>         * In 'narrow' coding intersex responses are recoded as 'sex and gender
#>           diverse'
#>         * In 'narrow' responses where people indicate they are trans and
#>           indicate their identified gender are recoded as the identified gender
#>           (e.g. 'Male to Female' is recoded as Female). We wish to acknowledge
#>           that this may not reflect how some individuals would classify
#>           themselves when given these categories and in some contexts may make
#>           systematic errors. The broad coding dictionary attempts to avoid these
#>           issues as much as possible - however users can provide a custom
#>           dictionary to add to or overwrite our coding decisions if they feel
#>           this is more appropriate. We welcome people to update the inbuilt
#>           dictionary where desired responses are missing.
#>         * The 'broad' coding seperates out those who identify as trans
#>           female/male or cis female/male into seperate categories it should not
#>           be assumed that all people who discribe as male/female are cis, if you
#>           are assessing trans status we recommend a two part question see:
#> 
#>           Bauer, Greta & Braimoh, Jessica & Scheim, Ayden & Dharma, Christoffer.
#>           (2017).
#>           Transgender-inclusive measures of sex/gender for population surveys:
#>           Mixed-methods evaluation and recommendations.
#>           PLoS ONE. 12.

df <- data.frame(stringsAsFactors=FALSE,
                 gender = c("male", "MALE", "mle", "I am male", "femail", "female", "enby"),
                 age = c(34L, 37L, 77L, 52L, 68L, 67L, 83L)
)

df
#>      gender age
#> 1      male  34
#> 2      MALE  37
#> 3       mle  77
#> 4 I am male  52
#> 5    femail  68
#> 6    female  67
#> 7      enby  83

df <- genderRecode(input=df,
                              genderColName = "gender", 
                              method = "broad",
                              outputColName = "gender2", 
                              missingValuesObjectName = NA,
                              customDictionary = NULL)
#> 
#> The following responses were not auto-recoded. The raw responses
#>         have been carried over to the recoded column 
#>  
#> # A tibble: 1 x 2
#> # Groups:   responses [1]
#>   responses     n
#>   <fct>     <int>
#> 1 i am male     1
df
#>      gender age    gender2
#> 1      male  34       male
#> 2      MALE  37       male
#> 3       mle  77       male
#> 4 I am male  52  i am male
#> 5    femail  68     female
#> 6    female  67     female
#> 7      enby  83 non-binary

df <- genderRecode(input=df,
                   genderColName = "gender", 
                   method = "broad",
                   outputColName = "gender2", 
                   missingValuesObjectName = NA,
                   customDictionary = NULL)
#> 
#> The following responses were not auto-recoded. The raw responses
#>         have been carried over to the recoded column 
#>  
#> # A tibble: 1 x 2
#> # Groups:   responses [1]
#>   responses     n
#>   <fct>     <int>
#> 1 i am male     1
df
#>      gender age    gender2    gender2
#> 1      male  34       male       male
#> 2      MALE  37       male       male
#> 3       mle  77       male       male
#> 4 I am male  52  i am male  i am male
#> 5    femail  68     female     female
#> 6    female  67     female     female
#> 7      enby  83 non-binary non-binary

Created on 2018-12-14 by the reprex package (v0.2.0).

Lingtax commented 5 years ago

As far as I can see it, the problem probably emerges from the core method in the genderRecode function.

As it presently operates this essentially:

  1. It splits off the gender input, column
  2. It left joins the input vector to the relevant dictionary
  3. It cbinds the recoded column to the original dataframe
  4. It names the new column (by index) to the new name.

We might resolve this problem by:

  1. Changing the methods by which the recoded data is added to the dataframe (elements 1:3), or
  2. Implementing a check of clashes between the names in the input data frame and the output name and either forcing an error, allowing user input to make an amendment, or doing an automatic correction (e.g. adding a suffix).

Obviously 2 is easier than 1, but 1 is ultimately "better" for efficiency of the function and predictability.

ekothe commented 5 years ago

Thanks @Lingtax I agree that 1 would be a better long term fix.

Lingtax commented 5 years ago

@njtierney do you have any suggestions? I think the join could directly happen on the original table and that would make it more efficient. I'm unsure about the best way to deal with the name clashes. If the input and output column names match, delete the input column before the join?

ekothe commented 5 years ago

@Lingtax Isn't the join using the input column? If so it can't delete it before the join.

njtierney commented 5 years ago

OK so I haven't looked through the source code, this may (or may not) be helpful:

spelling_list <- list(
  mael = "male",
  mail = "male",
  enby = "non-binary",
  femail = "female",
  female = "female",
  femael = "female",
  male = "male"
)

df <- data.frame(stringsAsFactors=FALSE,
                 gender = c("male", "MALE", "mle", "I am male", "femail", "female", "enby"),
                 age = c(34L, 37L, 77L, 52L, 68L, 67L, 83L)
)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# look up list
spelling_list[df$gender]
#> $male
#> [1] "male"
#> 
#> $<NA>
#> NULL
#> 
#> $<NA>
#> NULL
#> 
#> $<NA>
#> NULL
#> 
#> $femail
#> [1] "female"
#> 
#> $female
#> [1] "female"
#> 
#> $enby
#> [1] "non-binary"

df %>%
  mutate(recoded_gender = spelling_list[gender])
#>      gender age recoded_gender
#> 1      male  34           male
#> 2      MALE  37           NULL
#> 3       mle  77           NULL
#> 4 I am male  52           NULL
#> 5    femail  68         female
#> 6    female  67         female
#> 7      enby  83     non-binary

# turn it into a function that handles non-matches

recode_gender <- function(x){
  recoded_list <- spelling_list[x]
  recoded_list
}

recode_gender(df$gender)
#> $male
#> [1] "male"
#> 
#> $<NA>
#> NULL
#> 
#> $<NA>
#> NULL
#> 
#> $<NA>
#> NULL
#> 
#> $femail
#> [1] "female"
#> 
#> $female
#> [1] "female"
#> 
#> $enby
#> [1] "non-binary"

# need to handle non-matches - perhaps using approach from syn (https://github.com/ropenscilabs/syn/blob/master/R/syn.R#L19)

Created on 2018-12-14 by the reprex package (v0.2.1)

Handling non-matches needs to be handled in a different way.

I think duplicate names can be handled in the new features of tibble 2.0.0 https://www.tidyverse.org/articles/2018/11/tibble-2.0.0-pre-announce/

Also, is the data returned supposed to be a grouped_df?

njtierney commented 5 years ago

So I was a bit rushed with my last suggestion, I think I understand the problem better now. This was an interesting problem to think about, and I ended up spending a bit of time working on a different approach - what follows is a bit of a lengthy explanation for how I would approach this problem, which might not suit what you had planned, so please feel free to do your own thing! :)

My solution to this would be to sidestep the issue by providing two types of functions

  1. Take an input and a dictionary (with a default set) and returns recoded gender
  2. Add a column with a set name to the dataset

This means that you avoid the issue of name clashed by getting the user to decide the name (option 1), or they can use the syntactic sugar of option 2 and get the default name.

The way I implemented this is a little bit different to the join approach in gendercodeR - I used a list of dictionary names instead of a join. Lists are quite fast in R, but I'm not sure if this is any faster or better than your current join approach.

So first there is the setup of the data:

df <- data.frame(stringsAsFactors=FALSE,
                 gender = c("male", "MALE", "mle", "I am male", "femail", "female", "enby"),
                 age = c(34L, 37L, 77L, 52L, 68L, 67L, 83L)
)

This is the example dictionary I have provided

example_dictionary <- list(
  mael = "male",
  mail = "male",
  enby = "non-binary",
  femail = "female",
  female = "female",
  femael = "female",
  male = "male"
)

This is how you can use the list to get the returned names.

# look up list
example_dictionary[df$gender]
#> $male
#> [1] "male"
#> 
#> $<NA>
#> NULL
#> 
#> $<NA>
#> NULL
#> 
#> $<NA>
#> NULL
#> 
#> $femail
#> [1] "female"
#> 
#> $female
#> [1] "female"
#> 
#> $enby
#> [1] "non-binary"

And this is how I could use this in dplyr:

suppressPackageStartupMessages(library(dplyr))
# use in dplyr
df %>%
  mutate(recoded_gender = example_dictionary[gender])
#>      gender age recoded_gender
#> 1      male  34           male
#> 2      MALE  37           NULL
#> 3       mle  77           NULL
#> 4 I am male  52           NULL
#> 5    femail  68         female
#> 6    female  67         female
#> 7      enby  83     non-binary

This doesn't handle NULL cases like gendercodeR does, so let's handle that

# return which items in the list are missing
which_is_na <- function(x){
  which(is.na(names(x)))
}

# turn it into a function that handles non-matches
recode_gender <- function(x, 
                          dictionary = example_dictionary){ # you would set this to point to your dictionary
  recoded_list <- dictionary[x]
  # replace missing values with inputs
  recoded_list[which_is_na(recoded_list)] <- x[which_is_na(recoded_list)]
  # return the values of the named list
  purrr::flatten_chr(recoded_list)
}

Now the user can use this directly

# use directly
recode_gender(df$gender)
#> [1] "male"       "MALE"       "mle"        "I am male"  "female"    
#> [6] "female"     "non-binary"

Or they could use it in dplyr

# use in dplyr
df %>%
  mutate(recoded_gender = recode_gender(gender))
#>      gender age recoded_gender
#> 1      male  34           male
#> 2      MALE  37           MALE
#> 3       mle  77            mle
#> 4 I am male  52      I am male
#> 5    femail  68         female
#> 6    female  67         female
#> 7      enby  83     non-binary

But sometimes it is nice to be able to add the column more directly, here is a way to have that syntactic sugar in a dplyr-style way:

# add some syntactic sugar as an "add_" function. ------------------------------
add_recoded_gender <- function(.data, x){
  # capture input of x
  var <- rlang::enquo(x)

  .data %>%
    dplyr::mutate(.recoded_gender = recode_gender(!!var))
}

Which looks like this:

df %>%
  add_recoded_gender(gender)
#>      gender age .recoded_gender
#> 1      male  34            male
#> 2      MALE  37            MALE
#> 3       mle  77             mle
#> 4 I am male  52       I am male
#> 5    femail  68          female
#> 6    female  67          female
#> 7      enby  83      non-binary

Created on 2018-12-14 by the reprex package (v0.2.1)

Some other notes on implementing this approach would be that your custom dictionaries can be specified as lists by the user. To record information about which names were not matched, you could look into the problems() approach that readr uses

njtierney commented 5 years ago

Just wanted to update that I looked into the speed of this approach and it looks like using the list lookup is about 10-50x faster than using a join.

ekothe commented 5 years ago

I've refactored using this approach in the flight-of-fancy branch.

However since this does not maintain backwards comparability with the previous approach I'd love your opinions on this @rhydwyn @jlbeaudry @fsingletonthorn - I think it's stronger and as @njtierney notes it also has speed improvements.

fsingletonthorn commented 5 years ago

Seems good to me - that's a very neat trick @njtierney and it looks v. well implemented @ekothe!

njtierney commented 5 years ago

Glad you liked it. Your implementation looks great!

One thing is that you might want to consider using snake_case internally - e.g., here: https://github.com/ropenscilabs/gendercoder/blob/a376d394fe6e3aa6f0a159ced019e30f552b6601/R/recode_gender.R#L53

Have you considered submitting this for software review to rOpenSci? I think this would be a great package to submit.

ekothe commented 5 years ago

Glad you liked it. Your implementation looks great!

One thing is that you might want to consider using snake_case internally - e.g., here:

https://github.com/ropenscilabs/gendercoder/blob/a376d394fe6e3aa6f0a159ced019e30f552b6601/R/recode_gender.R#L53

Have you considered submitting this for software review to rOpenSci? I think this would be a great package to submit.

Good pickup @njtierney, that's what happens when you copy things from SO without making sure the styling is right :)

Wanting to submit rOpenSci review is part of what is motivating me to do the refactoring. Although I think we'd also need to think about changing the package name to all lowercase

jlbeaudry commented 5 years ago

Looks good to me, Emily. I think you're right about changing the package name, too, Emily. Consistency is pretty key. Let me know if you want some help with testing it out on some of the data sets.

Lingtax commented 3 years ago

This issue should now be closed, as the original problem is gone. You should probs add @njtierney as a ctb given he came up with a brilliant solve to the core function!