[x] #91 The statement of need could be fleshed out more. As a biostatistician working in the medical field, I was able to get a sense of what the package is meant to do since I'm familiar with ICD codes and co-morbidity indices such as Charlson. But some more details about the problem this solves, and who it is meant to help, could be usefully added. I believe that the package can take a dataset where patients have various codes, such as ICD codes, with associated dates, and it then uses regular expressions to identify specific codes of interest for creating, for example, the Charlson comorbidity index. Is that right? I understand that you are trying to be vague since the package could be used generally for any application, but I find that in not being specific about the medical application at all times it is harder to understand the core functionality. A key to clarity of purpose may lie in terminology, and clearly defining the different pieces of information this package can bring together.
[x] #92 There are currently some errors printed in the output on the README. Those errors were not replicated when I ran the code locally. But the README needs to be updated without errors.
coder vignette
[x] #93 I would have preferred that the trivial example from the vignette coder actually be a simple example of classifying patients into groups based on ICD codes, rather than an unrelated example based on car makes and models. You could also show in this vignette how the ex_carbrands classcodes object was created. I found myself wondering that as I read the vignette.
[x] #93 Under the "Interpretation" section, the listed names are mostly not in line with the actual output printed above. I assume you changed the contents of the example datasets and did not change the corresponding vignette text so this should be updated.
[x] #94 What was the index based on in this example? i.e. how did the function categorize() know that the index should be a sum of the matched cars owned by each person? In the documentation for the function categorize it says that a value of NULL would lead to the use of all available indices, but what does that mean exactly? Available from where?
[x] #93 In fact, the example on the README page is much more relevant and could easily be expanded to demonstrate the functionality in place of the car brand example. Maybe both are not needed.
classcodes vignette
[x] #95 I liked the output of the visualize() function, but seeing an example of an ICD code that would be identified by the given regular expression could be helpful to the less experienced reader. Presumably many users are familiar with ICD codes generally, but for the less experienced user this would be valuable.
[x] #96 I found the classcodes vignette a little hard to follow and found myself wondering throughout, "when would I use this?" and wishing for a concrete example that was used throughout demonstrating: this is what I want to do, and here is how I use these functions to do it.
Comorbidity vignette
[x] #97 The link to the Elixhauser co-morbidity index is broken
[x] #98 In the Risk Rx V example you mention creating dataframe ex_atc in which patients could have zero, one, or several codes prescribed. However, the way you create the data each patient in fact has at least one code. Either alter the example to match (preferable to demonstrate how patients without codes are handled) or alter the text.
[x] #99 You mention surgery dates farther down in the vignette, but at the beginning only refer to a vague event date. This example would be improved by making it more specific. Who are the patients, what are the dates, what are we trying to do, and why?
[x] #100 I noticed that when the default categorization was used, a nice message was printed in the R console stating what the classification was based on. This is a great use of messages. However, when a different classification was specified using cc_args this message was no longer printed. This could be standardized as I think the messaging is nice, and especially important when the user is changing it from some default.
Documentation/Examples
[x] #101 Both the DESCRIPTION file and coder help file link only to the GitHub repo, even though this package does have a website. Make the link to the website easier to find since it's the best place for new users to start.
Good use of multiple points of entry. I found descriptions of the package through the help file, the README, the website.
[x] #102 The help file for classcodes mentions the "triad of case data, code data, and classification data." I found this to be a real "aha!" moment for understanding your package but this was the first place I saw this particular description. I think you could usefully focus more on this three-piece idea throughout the README, help files, and vignettes, with consistent terminology used throughout. Also, perhaps function and argument names could be altered/unified to reflect the trinity of information sources. I didn't find the names particularly intuitive as they are and was constantly having to reference them again when I was testing the functions.
[x] #103 I'd like to see examples that utilize every option available in each function, rather than only simple examples. I often find myself wanting to use a non-default option in a function and not finding any example that helps me understand how to specify the option. Having simple examples is essential, but also including a more complex example that utilizes more than just the basic function arguments will really help users with more complex problems or more advanced users. For example, the example given in the classcodes help file isn't very useful. The object you pass to as.classcodes() in the example is already a classcodes object. How would I use the argument hierarchy? I don't know because you haven't given an example.
[x] #104 Many Roxygen comments went beyond my 80 character line, making readability difficult. You could use carriage returns to improve readability of comments.
Code is formatted well, and in line with tidy style guidelines.
[x] #105 First, the documentation for filter_dates should note that there is a default lower date limit. I had to go to the documentation of dates_within before I could understand the default behavior of filter_dates. Second, why is the default lower limit set to 1970-01-01? Does that have something to do with the coding schemes available within the package?
[x] #106 No example was included in the help file for set_classcodes
[x] #106 No example was included in the help file for as.keyvalue.classcodes
[x] #107 I see that codify throws an error if the values specified in the date argument aren't of class "Date". However I didn't see anywhere in the documentation information for the user specifying that dates need to be formatted specially. Typically when I read clinical data into R I end up with date columns that are formatted as character and so I would need to convert them to use these functions, so it might be worthwhile to include this detail somewhere.
Tests
[x] #108 There's an issue with the test-as.keyvalue file. It is currently saved as "test-as.keyvalueR" rather than "test-as.keyvalue.R". Also, this test file has an error. I believe instead of a <- as.keyvalue.classcodes(elixhauser, "icd10se") you should have a <- decoder::as.keyvalue(elixhauser, "icd10se"). And similarly within the test.
[x] #109 There are at least two tests named "multiplication works" that are unrelated to multiplication: test-all_classcodes, test-as.keyvalue
Functionality/performance
[x] #110 The use of 365 days to indicate a year is not exact. Depending on the year, some dates could be missed due to leap years. I see that using windows in units of days is an easy coding solution here, but perhaps some more complex options for when time windows are specified in months or years, rather than days, could be supplied as well.
[x] #111 I thought it was strange that codify did not return the same variable name for the date of interest as specified in data. Using your example data, ex_people has a date called event but when I combine it with ex_icd10 using codify the resulting data frame has a date variable named date rather than event. I would prefer to maintain the original variable name. This behavior does not occur in other functions; for example, categorize returns the original date variable name even when arguments are passed to codify.
[x] #112 I expected categorize to be able to take the results of a call to codify as input. This could be a useful alternative specification option, and is being done internally anyhow
[x] I didn't have a large dataset of the type that might be used in this setting to realistically test the claim that the performance is fast.
Package check notes:
[x] #113 I got the following notes about example time and file size when I ran check on my local Windows machine:
installed size is 37.0Mb
sub-directories of 1Mb or more:
Meta 8.0Mb
R 3.0Mb
data 3.0Mb
doc 10.0Mb
help 5.0Mb
html 2.0Mb
Examples with CPU (user + system) or elapsed time > 5s
user system elapsed
codebooks 1.39 2.53 57.08
codebook 1.39 2.41 59.13
categorize 0.23 1.05 22.28
classify 0.13 0.28 7.01
README
coder vignette
classcodes vignette
Comorbidity vignette
Documentation/Examples
Tests
Functionality/performance
Package check notes: