ropensci / coder

Classification of Cases into Deterministic Categories
https://docs.ropensci.org/coder/
22 stars 4 forks source link

Review comments from drgtwo #90

Closed eribul closed 3 years ago

eribul commented 4 years ago

Documentation on a "real" use case

Duplicate names to codify

people_doubled <- rbind(ex_people, ex_people)
codify(people_doubled, ex_icd10, id = "name", date = "event", days = c(-365, 0))

More importantly, don't there exist use cases for categorize where there are multiple events for the same patient, with different dates? Examples could include adverse events after starting multiple lines of therapy, or comorbidities before multiple diagnoses. In those cases, doesn't it make sense to return one row for each event, even if there are multiple for a patient? Should the check only error out when there are duplicate name/date pairs?

as.codedata

I think the as.codedata() approach can be improved to make the package more understandable and usable. Some issues:

It looks to me like the main reason for as.codedata() is to speed up the function by making it a data.table and setting keys. But you could do this within codify() as well; the only advantage this provides is if you run many codings with different ids/dates (or different arguments) while keeping the code data the same. I've done some benchmarking and it looks like the improvements become visible (in tens of milliseconds) when there are around million coded events.

Do we expect it to be common for users to run the package with millions of coding events, where the codedata stays the same while the input events change, and in an environment where fractions of a second matter? Is this common enough to be worth imposing extra instructions on every user of the package?

My recommendation is

regex_ in column names when tech_names = TRUE

It seems like the tech_names argument is designed to fix this, but it leaves prefixes like charlsonregex on every column name, which will need to be removed for meaningful downstream analysis. How about removing the charlsonregex, or at least the regex, in these cases? (Indeed, is there a reason that the charlson classcodes object itself has to have the regex prefixes? It already has an attribute regexprs that includes those column names). Besides which, perhaps consider leaving tech_names to default to TRUE for the reasons described above.

tibbles and data.tables

I recommend ending the functions with something like

# Where data was the argument passed in, and ret is what's about to be returned
if (tibble::is_tibble(data)) {
  ret <- tibble::as_tibble(ret)
}

This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by ~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.

Naming

eribul commented 3 years ago

All issues adressed and formal answer send!