ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

gendercoder: Recodes Sex/Gender Descriptions Into A Standard Set #435

Closed ekothe closed 2 years ago

ekothe commented 3 years ago

Date accepted: 2022-03-08 Submitting Author Name: Emily Kothe Submitting Author Github Handle: !--author1-->@ekothe<!--end-author1-- Other Package Authors Github handles: (comma separated, delete if none) @jlbeaudry, @fsingletonthorn, @rhydwyn, @Lingtax, @njtierney Repository: https://github.com/ropenscilabs/gendercoder Version submitted: 0.0.0.9000. Editor: !--editor-->@ldecicco-USGS<!--end-editor-- Reviewers: @juliasilge, @elinw

Due date for @juliasilge: 2021-05-19 Due date for @elinw: 2021-05-19

Archive: TBD Version accepted: TBD


Package: gendercoder
Title: Recodes Sex/Gender Descriptions Into A Standard Set
Version: 0.0.0.9000
Authors@R: c(
  person("Jennifer", "Beaudry", role = "aut",
    email = "jbeaudry@swin.edu.au",
    comment = c(ORCID = "https://orcid.org/0000-0003-1596-6708")),
  person("Emily", "Kothe", role = c("aut", "cre"),
    email = "emily.kothe@deakin.edu.au",
    comment = c(ORCID = "https://orcid.org/0000-0003-1210-0554")),
  person("Felix", "Singleton Thorn", role = "aut",
    email = "fsingletonthorn@gmail.com", 
    comment = c(ORCID = "https://orcid.org/0000-0002-0237-6146")),
  person("Rhydwyn", "McGuire", role = "aut",
    email = "rhydwyn@rhydwyn.net"),
  person("Nicholas", "Tierney", 
         role = c("aut"),
         email = "nicholas.tierney@gmail.com", 
         comment = c(ORCID = "https://orcid.org/0000-0003-1460-8722")),
  person("Mathew", "Ling", 
         role = c("aut"),
         email = "mathewtyling@gmail.com", 
         comment = c(ORCID = "https://orcid.org/0000-0002-0940-2538"))       
         )
Description: Provides functions and dictionaries for recoding of freetext gender responses into more consistent categories.
Depends: R (>= 3.0.0)
Maintainer: Emily Kothe <emily.kothe@deakin.edu.au>
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.1
Imports: 
    purrr
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    dplyr,
    covr
VignetteBuilder: knitr
URL: https://github.com/ropenscilabs/gendercoder
BugReports: https://github.com/ropenscilabs/gendercoder/issues

Scope

The goal of gendercoder is to allow simple re-coding of free-text gender responses. This is an important data munging task for researchers who collect gender in this format.

Best practice recommendations for the collection of gender information from human participants includes collection of some or all responses in free-text. This allows participants to provide gender responses when their gender is not captured by researcher provided categories (e.g. male, female, non-binary) or when gender is collected without researcher provided categories (e.g. “What is your gender”). gendercoder contains two in-built output dictionaries, a default broad dictionary which corrects spelling and standardises terms while maintaining the diversity of responses and a narrow dictionary which contains only three gender categories, “male”, “female”, and “sex and gender diverse”.

To our knowledge this is the only package of its kind.

N/A

N/A

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

noamross commented 3 years ago

Hello @ekothe and team, sorry to take so long to get back to you on this. The editors have gone a couple of rounds discussing this, and we decided that it falls out of scope for us. (Ultimately it comes down to whether this is "data munging of scientific formats"). Lots of packages that we like, and come out of the unconferences end up out of scope! We hope you continue with this project.

ekothe commented 3 years ago

Hi @noamross I wonder if this discussion led to any clarity about what constitutes an in scope "scientific format" that could be shared somewhere to save similar submissions in the future? Reading the package categories guidance it seems to us that insofar as gendercoder processes data from "unstructured sources such as text", and is used "...for handling data in specific scientific formats" it complies with the scope for data munging packages.

While the scope of the package is quite narrow, it has somewhat parallel functionality to coder , and we would speculate that by contrast the use cases of gendercoder are much wider and central to ropensci's mission of creating software tools that lower barriers to working with scientific data and that promotes a welcoming and diverse community given that most studies of human participants would collection of gender as a key demographic variable.

In this context, I hope you can understand our confusion and hope you can provide public guidance on why a package like this is out of scope to save avoid other submissions encountering similar issues in the future?

noamross commented 3 years ago

Hi @ekothe. This package was certainly an edge case, and it had come down to a matter of specificity vs. generality. Our scope for data extraction and munging packages aims to be those that apply to workflows that come out of specific formats vs. general sources. The reason for this is that we think its easier define a "standard" way to handle a specific format or structure, while preferences for workflows tend to be very subjective. Our interpretation in this case was that unstructured survey responses were a general type of data rather than specific format/source. I think the best improvement we could make to the documentation is to list more illustrative examples, and I will make a note to do so.

However, the editors took another look and found your analogy to the coder package compelling enough to reconsider our evaluation, so we have reversed our decision. Our scope definitions are always evolving and somewhat path-dependent. Thanks for making a clear argument. And as I said, we've always been very supportive of this project and the goals behind it!

noamross commented 3 years ago

@ropensci-review-bot assign @ldecicco-USGS as editor

ropensci-review-bot commented 3 years ago

Assigned! @ldecicco-USGS is now the editor

ekothe commented 3 years ago

Thank you @noamross, I appreciate the update and look forward to working with @ldecicco-USGS as the editor 😄

ldecicco-USGS commented 3 years ago

I will get started on asking around for reviewers today. We usually shoot to get reviewers assigned within 2 weeks, but that's been a struggle this past year (understandably!).

ldecicco-USGS commented 3 years ago

@ropensci-review-bot run goodpractice

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

ropensci-review-bot commented 3 years ago

Error (500). The gp service is currently unavailable

maelle commented 3 years ago

@ropensci-review-bot run goodpractice

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

ropensci-review-bot commented 3 years ago

Error (500). The gp service is currently unavailable

ldecicco-USGS commented 3 years ago

Sorry for all the bot messages! I ran the checks/goodpractices and everything looks good:

Preparing: description
Preparing: lintr
Preparing: namespace
Preparing: rcmdcheck
── GP gendercoder ───────────────────────────────────────────────────────────────

It is good practice to

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    inst\app\server.R:28:1
    inst\app\server.R:29:1
    inst\app\server.R:48:1
    inst\app\ui.R:43:1
    inst\app\ui.R:44:1
    ... and 7 more lines

  ✖ avoid sapply(), it is not type safe. It might return a
    vector, or a list, depending on the input data. Consider using
    vapply() instead.

    R\gendercoder_app.R:16:16

I wouldn't worry too much about the long lines, especially since they seem to all be in the shiny app. Let's see if the next "bot" works:

@ropensci-review-bot assign @juliasilge

ldecicco-USGS commented 3 years ago

@ropensci-review-bot assign @juliasilge

ldecicco-USGS commented 3 years ago

@ropensci-review-bot add @juliasilge to reviewers

ropensci-review-bot commented 3 years ago

@juliasilge added to the reviewers list. Review due date is 2021-05-19. Thanks @juliasilge for accepting to review! Please refer to our reviewer guide.

ldecicco-USGS commented 3 years ago

@ropensci-review-bot add @elinw to reviewers

ropensci-review-bot commented 3 years ago

@elinw added to the reviewers list. Review due date is 2021-05-19. Thanks @elinw for accepting to review! Please refer to our reviewer guide.

elinw commented 3 years ago

Looking forward to digging into this, having recently asked an open ended gender question on a survey. Putting this reference here too since it highlights the important issues http://www.stat.columbia.edu/~gelman/research/unpublished/Using_gender_in_surveys.pdf.

elinw commented 3 years ago

Two initial questions.

Why use purr rather than a hash table? See https://blog.dominodatalab.com/a-quick-benchmark-of-hashtable-implementations-in-r/ for a quick summary. ALso https://riptutorial.com/r/example/18339/environments-as-hash-maps

Should you name your dictionaries in a way that identifies them as English? Using this same pattern people will be able to make (and potentially contribute if you want to accept them) dictionaries in other languages.

For my recent survey I went from

(It also matched perfectly with my recoding.)

maelle commented 3 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

elinw commented 3 years ago

One way to do a hash would be to use the data.table hash form qdapTools.

This didn't give me totally the results I expected because it returned 1541 missing (as with narrow originally), but exploring options and maybe some pre or post processing would probably fix.

Get the data and create a data frame instead

value_broad <- sapply(broad, "[[", 1) value_narrow <- sapply(narrow, "[[", 1) key_broad <- names(value_broad) key_narrow <- names(value_narrow) value_broad <- unname(value_broad) value_narrow <- unname(value_narrow) broad_df <- data.frame("key" = key_broad, "value" = value_broad)

my data set is called students and the variable is called gender.

broad_hash <- qdapTools::hash(broad_df) students$new_gender <- qdapTools::hash_look(students$gender, broad_hash)

Lingtax commented 3 years ago

I have a partially functioning version with the base::environment hash method in a branch on my fork now, I just need to work out how to make the fill argument work again in that format. Assuming that method is the fastest as per the blog you forwarded, @elinw.

The only other hiccup will be supporting users in adding their own dictionaries or augmenting the inbuilts, as I found building the hash table was slightly more frustrating than I expected.

Get Outlook for iOShttps://aka.ms/o0ukef


From: elinw @.> Sent: Wednesday, May 5, 2021 1:55:07 AM To: ropensci/software-review @.> Cc: Mathew Ling @.>; Mention @.> Subject: Re: [ropensci/software-review] gendercoder: Recodes Sex/Gender Descriptions Into A Standard Set (#435)

One way to do a hash would be to use the data.table hash form qdapTools.

This didn't give me totally the results I expected because it returned 1541 missing (as with narrow originally), but exploring options and maybe some pre or post processing would probably fix.

Get the data and create a data frame instead

value_broad <- sapply(broad, "[[", 1) value_narrow <- sapply(narrow, "[[", 1) key_broad <- names(value_broad) key_narrow <- names(value_narrow) value_broad <- unname(value_broad) value_narrow <- unname(value_narrow) broad_df <- data.frame("key" = key_broad, "value" = value_broad)

my data set is called students and the variable is called gender.

broad_hash <- qdapTools::hash(broad_df) students$new_gender <- qdapTools::hash_look(students$gender, broad_hash)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/software-review/issues/435#issuecomment-832051006, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADJYMJFJSMMZ3F3Y6T3D3Q3TMAKFXANCNFSM4ZJNBY2Q.

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.

juliasilge commented 3 years ago

Why use purr rather than a hash table?

I think a really good reason for this is maintainability, ease of use, ease of user contributions (more important in this package than normal), and exactly the frustrations that @Lingtax mentions. I wouldn't necessarily jump to something like a hash table, which is a new idea to many R users (trained not as CS folks), compared to purrr, which is like lapply() but type safe. How big are people's surveys expected to be? What kind of performance gains are we talking about? I worked with surveys with 100k+ free text responses and using something like purrr would not be a bottleneck.

Lingtax commented 3 years ago

Another point is that IIRC, the core functionality isn't actually purrr based, I think Nick replaced that with a list comprehension method. purrr is only in there to make the dictionaries consistent to support the matching (ie. making them all lower case).

elinw commented 3 years ago

I've been looking into this issue exactly and I'll try to share some code in the next couple of days. I think for users you would provide a function that takes a data frame and turns it into a hash.
Of course another method that is not a hash, but still simpler than a list is to use a named vector.That is something familiar.

Lingtax commented 3 years ago

I think the original format was closer to if not a named vector, @ekothe might recall why the change was made. Agree that the list is slightly less intuitive than ideal.

elinw commented 3 years ago

Here's some sample code that is not in the form of functions, that illustrate what you could do staying in base R. It's not as efficient as it could. https://gist.github.com/elinw/fc73a5555ea6706a92b437715036093e

In my example I have a variable students$gender and I'm recoding it into students$gender2. Credit for the mapply part to a stack overflow link that I can't re-find right now .

Lingtax commented 3 years ago

Thanks Elin. That's similar to what I had, though the fill is closer than mine got. I'm not getting 100% yet, at present the fill in that gist fills using the first value of gender or fails (not a multiple of replacement length).

There's still going to be friction for users augmenting the in-built dictionaries with this pipeline though. At present users can just concatenate their custom options to the list in the call, e.g. recode_gender(gender, dictionary = c(broad, list("amel" = "male"))

We could move to a hash if that was the right option, but then we would need to make a new hash table after the function is called or update the old one (I have no idea, I've never deal with these before now), assuming the hash table comes in the package as an existing data object.

elinw commented 3 years ago

I actually updated slightly based on reading about performance. You could store the hashed dictionaries like you do now. I was thinking about the concatenation issue and I don't think it is a deal breaker, you could let people add to the dictionary with a function that would automate broad[["amel"]] <- "male" . e.g. add_terms(key, value, dictionary).

I really think this package is going to be useful. In my same survey I have an open-ended race/ethnicity question so this is a very general problem. Not right now, but that's something else to think about longer term.

elinw commented 3 years ago

Oh that is a stupid mistake in the fill, my bad for just throwing it in there. The right hand side needs to also pick the right rows.

students$gender2[-matched_index] <- students$gender[- matched_index]

This compares a function version of my code with recode_gender() which I modified to not have the message, which obviously slows things down.

> microbenchmark::microbenchmark(
+   recode_gender(students$gender, broad, fill = TRUE),
+   recode_hash(students$gender, dict_broad, fill = TRUE),
+   times=10L^4
+ )
Unit: milliseconds
                                                  expr      min
    recode_gender(students$gender, broad, fill = TRUE) 8.347974
 recode_hash(students$gender, dict_broad, fill = TRUE) 1.525007
       lq      mean    median        uq      max neval cld
 9.216002 11.925800 10.361171 13.866459 286.3089 10000   b
 1.612924  2.093359  1.761678  2.489311  17.0151 10000  a 

I would also think about adding a verbose argument so that people could not have that message if they want to use it programmatically,

elinw commented 3 years ago

So having dived into this some more, I actually profiled using a named vector and that was faster. I'm sure it depends on the size of the dictionary (if it had a million entries it would be slower) but maybe at a scale for this kind of action that would make sense.

ekothe commented 3 years ago

The question about performance gains of different methods is really interesting to me. However, I do think some context around expected size of both the data sets and dictionaries might help to place the expected gains from any kind of refactoring.

... How big are people's surveys expected to be? What kind of performance gains are we talking about? I worked with surveys with 100k+ free text responses and using something like purrr would not be a bottleneck.

It is hard to get clear numbers on this, but it might be useful to consider something like the UK Census to get a reasonable upper limit on dataset size. The UK Census has a free text gender response, but it is only completed by people who indicate that their sex assigned at birth is different from their gender identity (afaik it's the largest English language census with this kind of question). Stonewall UK estimates that about 1% of the UK population is trans, given the last UK Census was completed by approximately 56 million people, we could estimate that 560,000 people could provide a free-text response to this item (note that this is a dubious calculation because the question asks about legal gender and so some trans people will not provide a free text response and there has been some negative publicity that might add to malicious responses - however it provides a starting point for scoping the issue). I think in general, with current approaches at least, most people who are collecting gender identity from a very large number of people ask the question in such a way that most participants can provide a response from a forced choice (e.g. male/female) and only a subset of participants provide some free-text information. In data sets where all gender is recorded in free-text I would expect that the (small) number of responses would be completely undetectable (e.g. in psychology I'd expect to see datasets with 100-1000 participants).

So having dived into this some more, I actually profiled using a named vector and that was faster. I'm sure it depends on the size of the dictionary (if it had a million entries it would be slower) but maybe at a scale for this kind of action that would make sense.

In terms of the size of the dictionary, the most recent "Gender Census" (see gendercensus.com) collected data from 44,583 people who reported that their gender identity is not fully captured by the gender binary. From this there were 7,994 unique textbox entries for gender identity. With many of these being responses that would never be in a dictionary like gendercoder:: (e.g. "just [name]").

juliasilge commented 3 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

I recommend adding a full descriptive title to the vignette. Also I would edit to always use "built-in" or "inbuilt" but not switch back and forth.

I think there is a mistake in the example for the app; it says scivrs_app().

I recommend adding a second, short vignette on how, when, why to add to an existing dictionary or add a new dictionary, based on the content in data/howtoaddtodictionary.R. You can link to that from the README's contributing section.

Functionality

I know there has been some discussion of performance previously on this thread; my opinion is that purrr::map(dictionary, tolower) and purrr::flatten_chr(recoded_list) are very reasonable in this context (purrr is pretty darn lightweight) and that the use of lists for matching is great. 👍 My opinion is that this isn't a package whose applications make it a sensible place to invest more effort in performance improvements beyond what is already here.

Can you add the coverage badge to your README? You don't have any tests for the app, which isn't great; the rOpenSci package guide recommends shinytest.

Something is up with the pkgdown site, as it isn't up-to-date with the package (no info on the app).

Estimated hours spent reviewing: 3


Review Comments

I'd like to congratulate the package authors on a thoughtful handling of this important topic and for providing a tool that will help more survey creators and analysts ask free-text gender questions. This package is in excellent shape, and I think the choice of the main function API is good (input a character vector, return a character vector).

Did you consider any other names for the fill argument? It means something somewhat different in tidyr, which may be confusing to users.

My first major feedback is that this package uses the terms "male" and "female" for gender, but in my experience most practitioners are moving away from these terms for gender, instead of using them for sex. If you think about how people are more deeply understanding sex vs. gender, this package seems to be more about gender than sex (this is typically what survey questions are getting at) and better terms for that would be "man" and "woman". I strongly recommend changing this throughout the documentation and the data dictionaries.

The second major feedback is that I find the terms "broad" and "narrow" so confusing because to me, I would have flipped them, with the broad dictionary bucketing responses into "broadly" into categories all together and the narrow dictionary keeping lots of detailed categories. Is this just me? Did you all talk about any other names for the two options that might be more "broadly" 😩 understood?

Another note is that the discussions of context in the vignette and README are great, but I would also add an acknowledgement that our understanding is changing with time and that is expected and OK, etc. You can commit to changing these mappings as our understanding grows/changes.

Again, amazing job on creating this tool and making it available to the data science community. 🙌 I wish I would have had it when I was working on the Stack Overflow survey!

Lingtax commented 3 years ago

Thanks Julia. Apologies as I introduced the app as a half-baked idea after review had already started hence the bugs, poor documentation, and lack of tests. I've added some commits that hopefully will address those.

We'll address the documentation and vignette more extensively and have a discussion about how to proceed with the terminology. I agree re: "man / woman" and "fill", though unsure what to do on the latter front. retain_unmatched is more explicit, but it's quite clunky. Regarding broad and narrow, I personally find the current terminology to be intuitive, but I don't know idiographic it is, so good to note and we might do some polling.

Thank you again for your time and feedback, it's greatly appreciated.

ldecicco-USGS commented 3 years ago

Thanks @juliasilge for the thorough review! @elinw , you've provided some feedback - I just wanted to check that you are also planning to submit a final review (preferably using the template found here).

elinw commented 3 years ago

Here you go:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 5


Review Comments

I really like the concept of this package and think it will get a lot of use as the asking of open-ended gender items on surveys becomes increasingly common. This is part of a larger category of such questions. For example I recently asked about race/ethnicity in this format. I also asked the incoming cohort of first year students what major they were thinking of and that also needed coding. Further, many surveys include an “other ____” category which is often difficult to deal with. I’d imagine that this package would be extremely helpful.

Building on this comment, in reality, someone could add any dictionaries they wanted to this. That is, there is nothing inherently about gender. I think you should consider whether you rename it to something more generic and then simply say you are including the gender dictionaries as examples. I think that small change will have a huge impact on use.

The other point I would make about the dictionaries you have is that they are in English. Obviously this is not something that only applies in Anglophone settings. I think that you should change the names of the dictionaries to reflect the language in some consistent pattern. I’d suggest using the ISO IDs so you can differentiate EN-AU, EN-UK and EN-US, or more importantly different versions of Spanish and Chinese.

Shiny App

I attempted to pull in a large csv file which meant I couldn’t see the gender column. I’m sure there is a solution to that (maybe just display the gender column once it is selected?).

I was also surprised that it didn’t support Rda or Rds file types.

Documentation of the use of the Shiny appwould b helpful.

Coding

As I mentioned I do not think a list is the best way to store the dictionaries for a number of reasons. First, based on my testing that using a named vector instead gets you substantial performance improvement. I didn’t see any major performance benefit of using a hash table, so I think the ease of use of a named vector makes it the way to go, though in a much larger file there would be a payoff. You could rewrite the function like this

gender_named_vector <- function(x, nv, fill){
      recoded<- nv[trimws(tolower(x))]
      if (fill == TRUE){
        recoded <- ifelse(is.na(recoded), x, recoded)
      }
      unname( recoded )
    }

If you have a vector, a lot of other things become simpler. For example you just can use tolower() to make all of the characters lower case.

The gender_code() function should have a .dataargument because you should not be required to use the pipe. (See below).

One of the nice things about this package is that the code is very straightforward an cleanly written.

You have examples and documentation for the internal function which_is_na(). This means you’ll get a manual page that you don’t want. Also I think that CRAN will flag that. Also in the examples you should namespace the external functions. CRAN would also probably also like to start the function descriptions so that it has an upper case first letter.

Also try not to rely on other packages in your examples. If you do, namespace them. However think twice about using %>%. There's no reason to use it, you can change it to use what a .data argument.

I will definitely be recommending this package.

elinw commented 3 years ago

Also I agree with @juliasilge about broad and narrow, I thought they would have the opposite meanings. I also agree that you need to think of your dictionaries as living because these ideas are changing and that's how it is both with language and social constructions.

juliasilge commented 3 years ago

That's a really interesting point to consider about how general a task this actually is.

ldecicco-USGS commented 3 years ago

Thanks for the review @elinw 🎉 To add another option to the efficiency mix, was curious if the ifelse was the best way to go. Something like:

library(dplyr)

# here's a dictionary with the dplyr "starwars" data frame:
df_dictionary <- starwars %>% 
  select(homeworld) %>% 
  distinct() %>% 
  na.omit() %>% 
  mutate(homeworld = homeworld,
         new_name = c(paste0("first_", letters), 
                      paste0("second_", letters[1:22])))

#Let's take out 3 worlds to make sure the "fill" works
df_dictionary <- df_dictionary[-1:-3,]

repeat_rows <- 1000000
starwars_big <- data.frame(homeworld = rep(starwars$homeworld,
                                           repeat_rows),
                           species = rep(starwars$species, repeat_rows))

nv <- setNames(df_dictionary$new_name, df_dictionary$homeworld)

gender_named_vector_a <- function(x, nv, fill){
  recoded <- nv[x]
  if (fill == TRUE){
    recoded <- ifelse(is.na(recoded), x, recoded)
  }
  unname( recoded )
}

gender_named_vector_b <- function(x, nv, fill){
  recoded<- nv[x]
  if (fill == TRUE){
    recoded[is.na(recoded)] <-  x[is.na(recoded)]
  }
  unname( recoded )
}

system.time({
  x <- gender_named_vector_a(starwars_big$homeworld, nv,
                              fill = TRUE)
})
   user  system elapsed 
  22.22    1.36   23.58 
system.time({
  y <- gender_named_vector_b(starwars_big$homeworld, nv,
                              fill = TRUE)
})
   user  system elapsed 
   3.91    0.83    4.73 

Otherwise, I did a few tests comparing a named vector, to dplyr's left_join and joins using data.table. All were pretty darn similar to the named vector. So, if it would make more sense to keep the data in a data frame, something like this could work:

dp_join_sm <- function(df, df_dict, key, fill){

  df_recode <- df %>% 
    left_join(df_dict, by = key, na_matches = "never") 

  if (fill == TRUE){
    df_recode$new_name[is.na(df_recode$new_name)] <-  df_recode[[key]][is.na(df_recode$new_name)]
  }

  df_recode <- df_recode[,-which(names(df_recode) == key)]
  names(df_recode)[names(df_recode) == "new_name"] <- key

  return(df_recode)
}

system.time({
   z <- dp_join(starwars_big, df_dictionary, 
                   key = "homeworld", fill = TRUE)
 })
   user  system elapsed 
   6.56    0.86    7.42 

Anyway, just some morning futzing... I know recoding text has always been a bit of a bottleneck for me, so it's always good to re-check performance from time to time.

The ball's back in @ekothe and team's court for a response to reviewer.

elinw commented 3 years ago

The fill i think could also be handled in a different way, but let me write up something. A few clarifications.

ekothe commented 3 years ago

Thank you @elinw and @juliasilge for your reviews (and to @ldecicco-USGS for you contribution as well).

We will be working on this over the next couple of weeks.

There are a number of interesting issues raised, in particular, I notice that @elinw and @juliasilge have both flagged that "broad" and "narrow" had the opposite meaning for them than it does for the gendercoder team. I wonder if this is reflective of Australian English or simply that when creating the dictionaries we were thinking in terms of the number of categories that inputs would be coded into (such that broad = a large number of categories) whereas broad/narrow would have the opposite meaning if you conceptualise it in terms of the exactness of the matching. Because there seems to be such conflicting interpretation of that terminology it is obvious that we can't simply resolve it by reversing the terms, so we need to put some thought into a naming convention that is more interpretable to a wider range of people. I would note that we originally called narrow "three categories" but realised that that would be problematic if we ever needed to rework it so that there were more than three output categories of gender in the dictionary. We might need to elicit suggestions from a broader community of people to see if we can come up with something that reflects the intended usage of each dictionary more clearly.

elinw commented 3 years ago

One reason it was confusing to me is that there is also a dimension of "restrictive" and "something opposite" meaning that I wondered if narrow would have left more of the responses unclassified compared to broad.

On Fri, Jun 11, 2021 at 10:53 PM Emily Kothe @.***> wrote:

Thank you @elinw https://github.com/elinw and @juliasilge https://github.com/juliasilge for your reviews (and to @ldecicco-USGS https://github.com/ldecicco-USGS for you contribution as well).

We will be working on this over the next couple of weeks.

There are a number of interesting issues raised, in particular, I notice that @elinw https://github.com/elinw and @juliasilge https://github.com/juliasilge have both flagged that "broad" and "narrow" had the opposite meaning for them than it does for the gendercoder team. I wonder if this is reflective of Australian English or simply that when creating the dictionaries we were thinking in terms of the number of categories that inputs would be coded into (such that broad = a large number of categories) whereas broad/narrow would have the opposite meaning if you conceptualise it in terms of the exactness of the matching. Because there seems to be such conflicting interpretation of that terminology it is obvious that we can't simply resolve it by reversing the terms, so we need to put some thought into a naming convention that is more interpretable to a wider range of people. I would note that we originally called narrow "three categories" but realised that that would be problematic if we ever needed to rework it so that there were more than three output categories of gender in the dictionary. We might need to elicit suggestions from a broader community of people to see if we can come up with something that reflects the intended usage of each dictionary more clearly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/435#issuecomment-859987085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYI7N7Y4M2EZKXHMOUBGDTSLD2TANCNFSM4ZJNBY2Q .

ldecicco-USGS commented 3 years ago

Hi @ekothe ! Checking in to see if you have a general timeline for responding to the reviews.

ekothe commented 3 years ago

Thanks everyone for your comments, suggestions, and queries. We’re very fortunate to have such expert and generous reviewers and this package will be vastly improved for your input. I apologise for the delay in putting together this response (at this point it seems old hat to blame the pandemic - but we went back into lockdown for much of this year and it was incredibly disruptive).

Responses to comments are below.

The largest change we’ve made to the package is removing the shiny app from the master branch. This feature requires more dedicated development to ensure that documentation and tests are complete to a high standard. At this stage, we feel that incorporating such changes is an unnecessary delay to having a version of the package ready for release. We deeply appreciate the comments the reviewers have provided on the shiny app and will incorporate them into the development of the app in a later release of the package.

Response to @juliasilge

I recommend adding a full descriptive title to the vignette. Also I would edit to always use "built-in" or "inbuilt" but not switch back and forth. Response: Thank you, we agree that these changes improve clarity. We have the vignette title and descriptive text appropriately.

Corrected in a5bf20c

I think there is a mistake in the example for the app; it says scivrs_app(). Response: Costs of naive borrowing. This is now corrected. However, as noted - the app has been moved to a dev-branch to allow for further development.

Corrected in c84e7a6

I recommend adding a second, short vignette on how, when, why to add to an existing dictionary or add a new dictionary, based on the content in data/howtoaddtodictionary.R. You can link to that from the README's contributing section.

Response: Done.

Added in 432b45a

I know there has been some discussion of performance previously on this thread; my opinion is that purrr::map(dictionary, tolower) and purrr::flatten_chr(recoded_list) are very reasonable in this context (purrr is pretty darn lightweight) and that the use of lists for matching is great. 👍 My opinion is that this isn't a package whose applications make it a sensible place to invest more effort in performance improvements beyond what is already here.

Response: The core function has been updated to use named vector dictionaries. Thanks for your advice, testing, and code @elinw and @ldecicco-USGS.

Corrected in 42e150f

Can you add the coverage badge to your README? You don't have any tests for the app, which isn't great; the rOpenSci package guide recommends shinytest.

Response: Implementing shinytest via github actions has presented some unexpected challenges, partly because the app is not currently well implemented. As outlined above we have moved the shiny app to a dev branch. We have added a coverage badge for the master branch.

Coverage badge added in bfdeedc

Something is up with the pkgdown site, as it isn't up-to-date with the package (no info on the app). Response: Thank you, we’ve now implemented a github action to automate rebuilding of the pkgdown website. This should resolve this issue.

Corrected in 7bcd5cc

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Response: We are very pleased to acknowledge your contribution to the package and have added you using the reviewer role in the description file

Corrected in 3433bf1

Did you consider any other names for the fill argument? It means something somewhat different in tidyr, which may be confusing to users.

Response: This is a great UX comment, thanks. We’ve replaced the fill argument with retain_unmatched.

Corrected in 02225cc

My first major feedback is that this package uses the terms "male" and "female" for gender, but in my experience most practitioners are moving away from these terms for gender, instead of using them for sex. If you think about how people are more deeply understanding sex vs. gender, this package seems to be more about gender than sex (this is typically what survey questions are getting at) and better terms for that would be "man" and "woman". I strongly recommend changing this throughout the documentation and the data dictionaries. Response: We have updated accordingly, as this is very much more in keeping with our intent even though it’s inconsistent with current practice. One point of tension is our ignorance of any age-agnostic gender labels, so this change has resulted in addition of additional outcome labels (i.e. male now splits into man and boy conditional on inputs).

Corrected in 2d10761f

The second major feedback is that I find the terms "broad" and "narrow" so confusing because to me, I would have flipped them, with the broad dictionary bucketing responses into "broadly" into categories all together and the narrow dictionary keeping lots of detailed categories. Is this just me? Did you all talk about any other names for the two options that might be more "broadly" 😩 understood? Response: Thank you for this feedback. When we were first working on this project was had dictionary names that corresponded to the number of categories that gender was recoded into, but that quickly because impossible as we realised the number of potential categories we needed to deal with. After much discussion among the team we’ve settled on “manylevels_en” and “fewlevels_en” to indicate the numbers of levels of the factor that are created when using each dictionary. We hope that this improves the clarity of the dictionary names and thus the usability of the package.

Corrected in 1e1fa73

Another note is that the discussions of context in the vignette and README are great, but I would also add an acknowledgement that our understanding is changing with time and that is expected and OK, etc. You can commit to changing these mappings as our understanding grows/changes. Response: Thank you for this helpful suggestion. I’ve added some text to this effect.

Corrected in

Response to @elinw

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Response: We are very pleased to acknowledge your contribution to the package and have added you using the reviewer role in the description file

Corrected in 3433bf1

I really like the concept of this package and think it will get a lot of use as the asking of open-ended gender items on surveys becomes increasingly common. This is part of a larger category of such questions. For example I recently asked about race/ethnicity in this format. I also asked the incoming cohort of first year students what major they were thinking of and that also needed coding. Further, many surveys include an “other ____” category which is often difficult to deal with. I’d imagine that this package would be extremely helpful. Building on this comment, in reality, someone could add any dictionaries they wanted to this. That is, there is nothing inherently about gender. I think you should consider whether you rename it to something more generic and then simply say you are including the gender dictionaries as examples. I think that small change will have a huge impact on use. Response: We absolutely agree that this represents one instance of a more general task, but we are concerned that making this a more general package (even retaining the recode_gender function within) we might compromise the findability and uptake of this core function. We think this is a pressing issue to address and so would rather not compromise on that core aim and instead will spin off another coding package using the same skeleton function for ethnicity and nationality (@Lingtax is particularly taken by a one line conversion from country to United Nations geoscheme / World Bank / CIA world factbook / ICANN regions).

Corrected in NA

The other point I would make about the dictionaries you have is that they are in English. Obviously this is not something that only applies in Anglophone settings. I think that you should change the names of the dictionaries to reflect the language in some consistent pattern. I’d suggest using the ISO IDs so you can differentiate EN-AU, EN-UK and EN-US, or more importantly different versions of Spanish and Chinese. Response: We have added language based suffixes according to ISO 639-1, but felt regionalisation was unnecessary for these data.

Corrected in 54a7a7a

Shiny App

I attempted to pull in a large csv file which meant I couldn’t see the gender column. I’m sure there is a solution to that (maybe just display the gender column once it is selected?). Response: As you can see the Shiny app was accidentally added to the main branch during review, so was half-baked in many ways. This has now been corrected as per your suggestion.

Corrected in b153f06

I was also surprised that it didn’t support Rda or Rds file types. Response: @Lingtax didn’t originally add support for R data formats as he presumed anyone using these formats would preferentially use the functions directly. The package now supports Rds, but Rda presents problems (as Load() returns data objects to the environment in their state as saved, so he couldn’t make shiny cooperate)

Corrected in cb1ad84

Documentation of the use of the Shiny app would be helpful. Response: The largest change we’ve made to the package is removing the shiny app from the master branch. This feature requires more dedicated development to ensure that documentation and tests are complete to a high standard. At this stage, we feel that incorporating such changes is an unnecessary delay to having a version of the package ready for release. We deeply appreciate the comments the reviewers have provided on the shiny app and will incorporate them into the development of the app in a later release of the package.

Coding

As I mentioned I do not think a list is the best way to store the dictionaries for a number of reasons. First, based on my testing that using a named vector instead gets you substantial performance improvement. I didn’t see any major performance benefit of using a hash table, so I think the ease of use of a named vector makes it the way to go… Response: The core function has been updated to use named vector dictionaries. Thanks for your advice, testing, and code @elinw and @ldecicco-USGS.

Corrected in 42e150f

The gender_code() function should have a .data argument because you should not be required to use the pipe. (See below). Response: The function has not ever required the pipe, having the vector input as the first argument making it compatible with the pipe, but still permitting users to apply the function using either $ or [] notation as shown in the updated documentation.

One of the nice things about this package is that the code is very straightforward an cleanly written. Response: Thanks! We’re pleased that the amateur process has not compromised the product.

You have examples and documentation for the internal function which_is_na(). This means you’ll get a manual page that you don’t want. Also I think that CRAN will flag that. Also in the examples you should namespace the external functions. CRAN would also probably also like to start the function descriptions so that it has an upper case first letter. Response: The documentation is removed for which_is_na().

Corrected in 9250b80

Also try not to rely on other packages in your examples. If you do, namespace them. However think twice about using %>%. There's no reason to use it, you can change it to use a .data argument. Response: Examples are now independent of other packages.

Corrected in 1dcebb8

juliasilge commented 3 years ago

This is looking great!

ekothe commented 2 years ago

Thanks @juliasilge - I've removed the conflicted text (there was also some in the "manylevel_en.Rd"

We really appreciate the feedback on the dictionary names - even as a user of the package I'm finding them more intuitive 😄

juliasilge commented 2 years ago

I've got a couple of "clean up" notes from my previous review, I believe:

You can link to that from the README's contributing section.