[x] #126 One thing that's missing from the documentation (for functions, the README, and the vignettes) is an example of what someone would do with the output of the package. The example in the "coder" vignette isn't healthcare-related, instead focused on a toy example involving patient cars. Perhaps replace it with a vignette that works through a "real" example, and shows at least one result from it that would be similar to what your next typical step would be. Besides being more interesting, it serves as documentation for the output: if you created a histogram of Charlson indices, it brings the user's attention to the importance of that column.
You could add something from the result of your ex_people and ex_icd10, but that join has only a single positive result (one patient with peripheral vascular disease). Of course real healthcare data is necessarily private, but you could instead consider taking a small sample of rows and columns from the SynPUF data, which is synthetic emergency room data including admissions and diagnoses. After using coder to determine what diagnoses occurred after an emergency room visit, the vignette could get one or two results from the data (e.g. showing the average Charlson comorbidity index of emergency room admissions over time, creating a histogram of them, or showing the most common diagnoses within the window). This would help communicate why it's helpful to annotate a dataset in this way.
(You don't need to use a sample from SynPUF if you don't want to; you could also just construct the simulated data to have more joined diagnoses).
[x] #114 The README is solid, but does jumps immediately into examples of simulated data, without discussing why someone might want to join diagnoses in the previous year. This doesn't have to take much text; it could be a short bulleted list of use cases like "Discovering adverse events after surgery"/"Determining comorbidities before clinical trials." It's likely that users already know their use case, but an example lets them recognize it and think "this package is for me!"
[x] #115 A second piece of advice with the README is to start with an example that doesn't include a date column before showing the one that does, to ease the user into the use of the categorize function with relatively few arguments.
[x] The comorbidities vignette and the JOSS paper are very well done in terms of giving the appropriate level of background and describing the use case.
Duplicate names to codify
[x] #116 If there are duplicate names in the data passed to codify(), it returns a data.table error that isn't informative toward fixing the problem. (categorize() does catch this with "Non-unique ids!" but not codify()).
people_doubled <- rbind(ex_people, ex_people)
codify(people_doubled, ex_icd10, id = "name", date = "event", days = c(-365, 0))
More importantly, don't there exist use cases for categorize where there are multiple events for the same patient, with different dates? Examples could include adverse events after starting multiple lines of therapy, or comorbidities before multiple diagnoses. In those cases, doesn't it make sense to return one row for each event, even if there are multiple for a patient? Should the check only error out when there are duplicate name/date pairs?
as.codedata
I think the as.codedata() approach can be improved to make the package more understandable and usable. Some issues:
By convention as.X functions in R return an object of class X, but this returns a data.table.
codify() describes the second argument as "output from as.codedata", but the function still works if given a data frame, data.table, or tibble.
By default, as.codedata() filters out dates in the future and dates before 1970. I assume this is meant to remove bad data, but isn't it better to leave such data quality filters to the user? As it is, the user must go through a few pages of documentation (codify/categorize -> as.codedata -> dates_within) to learn about this behavior. And in any case where there's a date window, the extreme date values won't affect the coding anyway.
It looks to me like the main reason for as.codedata() is to speed up the function by making it a data.table and setting keys. But you could do this within codify() as well; the only advantage this provides is if you run many codings with different ids/dates (or different arguments) while keeping the code data the same. I've done some benchmarking and it looks like the improvements become visible (in tens of milliseconds) when there are around million coded events.
Do we expect it to be common for users to run the package with millions of coding events, where the codedata stays the same while the input events change, and in an environment where fractions of a second matter? Is this common enough to be worth imposing extra instructions on every user of the package?
My recommendation is
[x] #117 Don't export as.codedata, and instead do the preprocessing/checking of codedata within the codify function instead of suggesting that the user use as.codedata
[x] #118 In the documentation for codify and classify, as well as your documentation examples, describe the codedata input as "a table with columns id, code, and optionally date."
[x] #117 If you're very confident that performance when keeping the codedata the same and trying many different datasets is important, you could add a function called classcodes_prepare or prepare_classcodes that does the conversion to data.table and sets the indices, and could describe that in the details section of the documentation. But I'd want to understand why that's a typical use case.
[x] #119 Relatedly, I recommend removing (or at least making internal) dates_within() and filter_dates(). Their purpose (applying a filter on dates with some defaults) has no relationship to the rest of the package, and is something the user can do themselves with tools they're accustomed to (base R, data.table, or dplyr).
regex_ in column names when tech_names = TRUE
[x] #120 The output of categorize() on a table returns columns with spaces in their names. This isn't well set up for additional analysis, since it makes it difficult to do any kind of programming with them, including using data.table to filter for one diagnosis or to aggregate the percentage of patients (perhaps within each group) that have a condition. It's nice for displaying the names in a table, but is it a common use case to display individual patients in a table (as opposed to aggregated statistics?)
It seems like the tech_names argument is designed to fix this, but it leaves prefixes like charlsonregex on every column name, which will need to be removed for meaningful downstream analysis. How about removing the charlsonregex, or at least the regex, in these cases? (Indeed, is there a reason that the charlson classcodes object itself has to have the regex prefixes? It already has an attribute regexprs that includes those column names). Besides which, perhaps consider leaving tech_names to default to TRUE for the reasons described above.
tibbles and data.tables
[x] #121 Your examples like ex_people are tibbles, but when categorize() or codify() is passed a tibble, it returns a data.table. This would be a surprising behavior for people using these packages within a tidyverse workflow. I think data.table is a terrific package, but there's not a reason to surprise users with the data type if they're not accustomed to it. (And the fact that the example datasets are tibbles rather than data.frames or data.tables adds to the inconsistency a bit).
I recommend ending the functions with something like
# Where data was the argument passed in, and ret is what's about to be returned
if (tibble::is_tibble(data)) {
ret <- tibble::as_tibble(ret)
}
This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by ~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.
[x] #122 Relatedly (though less important), the example datasets don't print as tibbles by default. If you follow the instructions in usethis::use_tibble(), you could support printing it as a tibble even when the tibble/dplyr packages aren't loaded. The additional advantage of this is that you could get rid of most of the uses of head() in the README, making your examples more concise and focused on your use case.
Naming
[x] #123 index and especially visualize are very generic names for very specific functions, and doesn't give any hints about what they're used for. How about visualize_classcodes?
[x] #124 An alternative for function naming is to have a common prefix for functions, e.g., coder_classify, coder_categorize, coder_index, coder_codify, coder_visualize. This has both the advantage of ensuring it doesn't overlap with other packages and making it easy to find codify-related functions with autocomplete. But that's just a suggestion.
[x] #125 I agree with Noam that coder isn't an ideal package name, if only because it makes the online resources a bit harder for users to find. Try Googling "coder", "R coder", or "R coder github"! But if it's too late to change the name, I don't consider it a dealbreaker.
Documentation on a "real" use case
[x] #126 One thing that's missing from the documentation (for functions, the README, and the vignettes) is an example of what someone would do with the output of the package. The example in the "coder" vignette isn't healthcare-related, instead focused on a toy example involving patient cars. Perhaps replace it with a vignette that works through a "real" example, and shows at least one result from it that would be similar to what your next typical step would be. Besides being more interesting, it serves as documentation for the output: if you created a histogram of Charlson indices, it brings the user's attention to the importance of that column. You could add something from the result of your ex_people and ex_icd10, but that join has only a single positive result (one patient with peripheral vascular disease). Of course real healthcare data is necessarily private, but you could instead consider taking a small sample of rows and columns from the SynPUF data, which is synthetic emergency room data including admissions and diagnoses. After using coder to determine what diagnoses occurred after an emergency room visit, the vignette could get one or two results from the data (e.g. showing the average Charlson comorbidity index of emergency room admissions over time, creating a histogram of them, or showing the most common diagnoses within the window). This would help communicate why it's helpful to annotate a dataset in this way. (You don't need to use a sample from SynPUF if you don't want to; you could also just construct the simulated data to have more joined diagnoses).
[x] #114 The README is solid, but does jumps immediately into examples of simulated data, without discussing why someone might want to join diagnoses in the previous year. This doesn't have to take much text; it could be a short bulleted list of use cases like "Discovering adverse events after surgery"/"Determining comorbidities before clinical trials." It's likely that users already know their use case, but an example lets them recognize it and think "this package is for me!"
[x] #115 A second piece of advice with the README is to start with an example that doesn't include a date column before showing the one that does, to ease the user into the use of the categorize function with relatively few arguments.
[x] The comorbidities vignette and the JOSS paper are very well done in terms of giving the appropriate level of background and describing the use case.
Duplicate names to codify
More importantly, don't there exist use cases for categorize where there are multiple events for the same patient, with different dates? Examples could include adverse events after starting multiple lines of therapy, or comorbidities before multiple diagnoses. In those cases, doesn't it make sense to return one row for each event, even if there are multiple for a patient? Should the check only error out when there are duplicate name/date pairs?
as.codedata
I think the as.codedata() approach can be improved to make the package more understandable and usable. Some issues:
It looks to me like the main reason for as.codedata() is to speed up the function by making it a data.table and setting keys. But you could do this within codify() as well; the only advantage this provides is if you run many codings with different ids/dates (or different arguments) while keeping the code data the same. I've done some benchmarking and it looks like the improvements become visible (in tens of milliseconds) when there are around million coded events.
Do we expect it to be common for users to run the package with millions of coding events, where the codedata stays the same while the input events change, and in an environment where fractions of a second matter? Is this common enough to be worth imposing extra instructions on every user of the package?
My recommendation is
regex_ in column names when tech_names = TRUE
It seems like the tech_names argument is designed to fix this, but it leaves prefixes like charlsonregex on every column name, which will need to be removed for meaningful downstream analysis. How about removing the charlsonregex, or at least the regex, in these cases? (Indeed, is there a reason that the charlson classcodes object itself has to have the regex prefixes? It already has an attribute regexprs that includes those column names). Besides which, perhaps consider leaving tech_names to default to TRUE for the reasons described above.
tibbles and data.tables
I recommend ending the functions with something like
This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by ~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.
Naming