Closed eribul closed 3 years ago
Hi @eribul and thanks for your submission!
After discussion, the editors have found your package to be in scope. I'll be assigning a handling editor shortly π
Hi @eribul, thanks for your submission. I am amazed that there is not a package called coder
on CRAN yet. It's not a requirement, but might you consider a more specific name? coder
could invoke lots of meanings depending on the audience. That said, it appears this has already been used for publication.
Here is the output of goodpractice::gp()
. Excellent test coverage! No big deals here, you might want to reflow Roxygen comments for readability, and look if you are importing more than you need, but this is all minor stuff that can be addressed post-review. I am seeking reviewers now.
ββ GP coder ββββββββββββββ
It is good practice to
β write unit tests for all functions, and all package code
in general. 97% of code lines are covered by test cases.
R/classify.R:47:NA
R/codebook.R:4:NA
R/codebook.R:5:NA
R/codedata.R:60:NA
R/codify.R:58:NA
... and 4 more lines
β avoid long code lines, it is bad for readability. Also,
many people prefer editor windows that are about 80 characters
wide. Try make your lines shorter than 80 characters
R/categorize.R:4:1
R/categorize.R:10:1
R/categorize.R:16:1
R/categorize.R:45:1
R/categorize.R:64:1
... and 30 more lines
β not import packages as a whole, as this can cause name
clashes between the imported packages. Instead, import only the
specific functions you need.
βββββββββββββββββββββββββββββββββββ
Reviewer 1: @zabore
Reviewer 2: @dgrtwo
Due date: 2020-07-02
Thank you very much @noamross! I am very happy for your response and will just briefly adress some of your comments:
I was actually equaliy amazed regarding the name :-) I do like coder
however since: 1) I have another CRAN package called decoder
so they fit quite nicely together. 2) I realise there is some ambiguity but that is kind of intential since both "medical coding" and "computer coding" is relevant :-) 3) I did use another name initially, that was later used by others, 4) As you mention, the package has already been used for some published work.
Considering DOI, I would be happy to provide such trough Zenodo. Just let me know, and I will fix that!
I realize that the importing of packages is unusal but one of them is data.table
(for which this is recommended), and the other one is my other package decoder
that is also intermingled with this package in such a way that this seems reasonable. I have been careful to not depend on any other packages (although a few more are suggested, mostly for developmnent purposes).
I am looking forward to the review process and thank you once more!
Great, @eribul. You don't need a DOI until the end of the review process to transfer the review to JOSS.
We have two reviewers: @zabore and @dgrtwo, thank you for volunteering! Due date is 2020-07-02. Please note that this is a pass-through review to JOSS, so please include in your review whether paper/paper.md
conforms to JOSS requirements.
Here is my review!
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 6
I have had to classify patients based on data that contained a variety of types of ICD codes in the past and it was incredibly cumbersome, so a package that automates this process for some standard use cases and with customization possible is a great contribution!
coder
vignettecoder
actually be a simple example of classifying patients into groups based on ICD codes, rather than an unrelated example based on car makes and models. You could also show in this vignette how the ex_carbrands
classcodes
object was created. I found myself wondering that as I read the vignette.categorize()
know that the index should be a sum of the matched cars owned by each person? In the documentation for the function categorize
it says that a value of NULL
would lead to the use of all available indices, but what does that mean exactly? Available from where?classcodes
vignetteclasscodes
vignette a little hard to follow and found myself wondering throughout, "when would I use this?" and wishing for a concrete example that was used throughout demonstrating: this is what I want to do, and here is how I use these functions to do it.Comorbidity
vignetteex_atc
in which patients could have zero, one, or several codes prescribed. However, the way you create the data each patient in fact has at least one code. Either alter the example to match (preferable to demonstrate how patients without codes are handled) or alter the text.cc_args
this message was no longer printed. This could be standardized as I think the messaging is nice, and especially important when the user is changing it from some default.coder
help file link only to the GitHub repo, even though this package does have a website. Make the link to the website easier to find since it's the best place for new users to start.classcodes
mentions the "triad of case data, code data, and classification data." I found this to be a real "aha!" moment for understanding your package but this was the first place I saw this particular description. I think you could usefully focus more on this three-piece idea throughout the README, help files, and vignettes, with consistent terminology used throughout. Also, perhaps function and argument names could be altered/unified to reflect the trinity of information sources. I didn't find the names particularly intuitive as they are and was constantly having to reference them again when I was testing the functions.classcodes
help file isn't very useful. The object you pass to as.classcodes()
in the example is already a classcodes
object. How would I use the argument hierarchy
? I don't know because you haven't given an example.filter_dates
should note that there is a default lower date limit. I had to go to the documentation of dates_within
before I could understand the default behavior of filter_dates
. Second, why is the default lower limit set to 1970-01-01? Does that have something to do with the coding schemes available within the package?set_classcodes
as.keyvalue.classcodes
codify
throws an error if the values specified in the date
argument aren't of class "Date". However I didn't see anywhere in the documentation information for the user specifying that dates need to be formatted specially. Typically when I read clinical data into R I end up with date columns that are formatted as character and so I would need to convert them to use these functions, so it might be worthwhile to include this detail somewhere.a <- as.keyvalue.classcodes(elixhauser, "icd10se")
you should have a <- decoder::as.keyvalue(elixhauser, "icd10se")
. And similarly within the test.codify
did not return the same variable name for the date of interest as specified in data
. Using your example data, ex_people
has a date called event
but when I combine it with ex_icd10
using codify
the resulting data frame has a date variable named date
rather than event
. I would prefer to maintain the original variable name. This behavior does not occur in other functions; for example, categorize
returns the original date variable name even when arguments are passed to codify
.categorize
to be able to take the results of a call to codify
as input. This could be a useful alternative specification option, and is being done internally anyhowI got the following notes about example time and file size when I ran check on my local Windows machine:
installed size is 37.0Mb
sub-directories of 1Mb or more:
Meta 8.0Mb
R 3.0Mb
data 3.0Mb
doc 10.0Mb
help 5.0Mb
html 2.0Mb
Examples with CPU (user + system) or elapsed time > 5s
user system elapsed
codebooks 1.39 2.53 57.08
codebook 1.39 2.41 59.13
categorize 0.23 1.05 22.28
classify 0.13 0.28 7.01
Thanks for your thorough review, @zabore. @dgrtwo, please get your review in so that @eribul can respond to your comments jointly.
Apologies for my delay, my review is below!
Erik, am I right this is your first ROpenSci submission? Congrats and thanks for contributing! π₯³
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [X] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [X] A short summary describing the high-level functionality of the software
- [X] Authors: A list of authors with their affiliations
- [X] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [X] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 5
coder
takes a set of patients, links them to diagnosis codes, and links those to healthcare indices. From my brief experience in healthcare I know this is an important use case with many applications. While other packages (which the author cites perfectly in the JOSS manuscript) include tools for calculating , the ability to match many diagnosis codes at once, and to calculate multiple indices in the same way, make the coder package an important contribution to the field. Great job!
The package follows the ROpenSci guidelines, is thoroughly tested and has chosen a high-performance approach. I've installed and checked it, and outside of the one bug described below (duplicate names to codify()
) I found it to work for me on use cases.
I have some recommendations for specific changes below, but I'm of course I'm open to dialogue.
One thing that's missing from the documentation (for functions, the README, and the vignettes) is an example of what someone would do with the output of the package. The example in the "coder" vignette isn't healthcare-related, instead focused on a toy example involving patient cars. Perhaps replace it with a vignette that works through a "real" example, and shows at least one result from it that would be similar to what your next typical step would be. Besides being more interesting, it serves as documentation for the output: if you created a histogram of Charlson indices, it brings the user's attention to the importance of that column.
You could add something from the result of your ex_people
and ex_icd10
, but that join has only a single positive result (one patient with peripheral vascular disease). Of course real healthcare data is necessarily private, but you could instead consider taking a small sample of rows and columns from the SynPUF data, which is synthetic emergency room data including admissions and diagnoses. After using coder to determine what diagnoses occurred after an emergency room visit, the vignette could get one or two results from the data (e.g. showing the average Charlson comorbidity index of emergency room admissions over time, creating a histogram of them, or showing the most common diagnoses within the window). This would help communicate why it's helpful to annotate a dataset in this way.
(You don't need to use a sample from SynPUF if you don't want to; you could also just construct the simulated data to have more joined diagnoses).
The README is solid, but does jumps immediately into examples of simulated data, without discussing why someone might want to join diagnoses in the previous year. This doesn't have to take much text; it could be a short bulleted list of use cases like "Discovering adverse events after surgery"/"Determining comorbidities before clinical trials." It's likely that users already know their use case, but an example lets them recognize it and think "this package is for me!"
A second piece of advice with the README is to start with an example that doesn't include a date column before showing the one that does, to ease the user into the use of the categorize
function with relatively few arguments.
The comorbidities vignette and the JOSS paper are very well done in terms of giving the appropriate level of background and describing the use case.
If there are duplicate names in the data
passed to codify()
, it returns a data.table error that isn't informative toward fixing the problem. (categorize()
does catch this with "Non-unique ids!" but not codify()
).
people_doubled <- rbind(ex_people, ex_people)
codify(people_doubled, ex_icd10, id = "name", date = "event", days = c(-365, 0))
More importantly, don't there exist use cases for categorize
where there are multiple events for the same patient, with different dates? Examples could include adverse events after starting multiple lines of therapy, or comorbidities before multiple diagnoses. In those cases, doesn't it make sense to return one row for each event, even if there are multiple for a patient? Should the check only error out when there are duplicate name/date pairs?
I think the as.codedata()
approach can be improved to make the package more understandable and usable. Some issues:
as.X
functions in R return an object of class X, but this returns a data.table.codify()
describes the second argument as "output from as.codedata", but the function still works if given a data frame, data.table, or tibble.as.codedata()
filters out dates in the future and dates before 1970. I assume this is meant to remove bad data, but isn't it better to leave such data quality filters to the user? As it is, the user must go through a few pages of documentation (codify/categorize -> as.codedata -> dates_within) to learn about this behavior. And in any case where there's a date window, the extreme date values won't affect the coding anyway.It looks to me like the main reason for as.codedata()
is to speed up the function by making it a data.table and setting keys. But you could do this within codify()
as well; the only advantage this provides is if you run many codings with different ids/dates (or different arguments) while keeping the code data the same. I've done some benchmarking and it looks like the improvements become visible (in tens of milliseconds) when there are around million coded events.
Do we expect it to be common for users to run the package with millions of coding events, where the codedata stays the same while the input events change, and in an environment where fractions of a second matter? Is this common enough to be worth imposing extra instructions on every user of the package?
My recommendation is
as.codedata
, and instead do the preprocessing/checking of codedata within the codify function instead of suggesting that the user use as.codedata
codify
and classify
, as well as your documentation examples, describe the codedata input as "a table with columns id
, code
, and optionally date
."classcodes_prepare
or prepare_classcodes
that does the conversion to data.table and sets the indices, and could describe that in the details
section of the documentation. But I'd want to understand why that's a typical use case.Relatedly, I recommend removing (or at least making internal) dates_within()
and filter_dates()
. Their purpose (applying a filter on dates with some defaults) has no relationship to the rest of the package, and is something the user can do themselves with tools they're accustomed to (base R, data.table, or dplyr).
The output of categorize()
on a table returns columns with spaces in their names. This isn't well set up for additional analysis, since it makes it difficult to do any kind of programming with them, including using data.table to filter for one diagnosis or to aggregate the percentage of patients (perhaps within each group) that have a condition. It's nice for displaying the names in a table, but is it a common use case to display individual patients in a table (as opposed to aggregated statistics?)
It seems like the tech_names
argument is designed to fix this, but it leaves prefixes like charlson_regex_
on every column name, which will need to be removed for meaningful downstream analysis. How about removing the charlson_regex_
, or at least the regex_
, in these cases? (Indeed, is there a reason that the charlson
classcodes object itself has to have the regex_
prefixes? It already has an attribute regexprs
that includes those column names). Besides which, perhaps consider leaving tech_names
to default to TRUE for the reasons described above.
Your examples like ex_people
are tibbles, but when categorize()
or codify()
is passed a tibble, it returns a data.table. This would be a surprising behavior for people using these packages within a tidyverse workflow. I think data.table is a terrific package, but there's not a reason to surprise users with the data type if they're not accustomed to it. (And the fact that the example datasets are tibbles rather than data.frames or data.tables adds to the inconsistency a bit).
I recommend ending the functions with something like
# Where data was the argument passed in, and ret is what's about to be returned
if (tibble::is_tibble(data)) {
ret <- tibble::as_tibble(ret)
}
This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by ~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.
Relatedly (though less important), the example datasets don't print as tibbles by default. If you follow the instructions in usethis::use_tibble()
, you could support printing it as a tibble even when the tibble/dplyr packages aren't loaded. The additional advantage of this is that you could get rid of most of the uses of head()
in the README, making your examples more concise and focused on your use case.
index
and especially visualize
are very generic names for very specific functions, and doesn't give any hints about what they're used for. How about visualize_classcodes
?coder_classify
, coder_categorize
, coder_index
, coder_codify
, coder_visualize
. This has both the advantage of ensuring it doesn't overlap with other packages and making it easy to find codify-related functions with autocomplete. But that's just a suggestion.Thanks for your detailed review, @dgrtwo! @eribul, lots to digest in both of these! Please let us know if you have questions and update us with a summary of changes in your response.
Dear @noamross @dgrtwo and @annakrystalli I am so impressed by your efforts and comments and I am really looking forward to improve the package accordingly! I am currently on vacation without computer access however. I therefore hope that a slight delay would be acceptable? Thank you once more, I really enjoy this experience!
@eribul, yes, that's fine. We aim for a response to reviews within two weeks, even if all the changes involved are yet to be completed, but think of that timeframe as starting when you return. Enjoy your vacation!
Once again I must appologize for my late response! I have been back from vacation but had to put intense work into my PhD thesis with deadline from the print shop today. I am now on vacation once more, for an additional two weeks. I do have access to my computer, however, and I hope to be able to do at least some work :-)
I have organized the raied issues from @zabore here: https://github.com/eribul/coder/issues/89 and from @dgrtwo here https://github.com/eribul/coder/issues/90 I have also fixed some of the smaller issues from @zabore
It is my plan to go through all the issues and then to summarize my response here.
Again, sorry for the delay! This review is top-of-mind and I really look forward to summarize my response. I just realize, however, that my upcomming PhD defense, combined with additional undertakings, will block most days of my calender until at least 2020-09-18. I have been working on several issues, however, and the partiel progress can be tracked by the links provided in the previous comment.
First of all, I would like to express my gratitude, both for the impressive work performed for the review, but also for letting me take the time to address the raised concerns! I have now (finally) been able to (hopefully) make all the changes and improvements as requested. I will address each comment in a point-by point manner below!
I have had to classify patients based on data that contained a variety of types of ICD codes in the past and it was incredibly cumbersome, so a package that automates this process for some standard use cases and with customization possible is a great contribution!
Thank you very much for your kind words and appreciation!
The statement of need could be fleshed out more. As a biostatistician working in the medical field, I was able to get a sense of what the package is meant to do since I'm familiar with ICD codes and co-morbidity indices such as Charlson. But some more details about the problem this solves, and who it is meant to help, could be usefully added. I believe that the package can take a dataset where patients have various codes, such as ICD codes, with associated dates, and it then uses regular expressions to identify specific codes of interest for creating, for example, the Charlson comorbidity index. Is that right? I understand that you are trying to be vague since the package could be used generally for any application, but I find that in not being specific about the medical application at all times it is harder to understand the core functionality. A key to clarity of purpose may lie in terminology, and clearly defining the different pieces of information this package can bring together.
A: I agree and have expanded this section of the README.
There are currently some errors printed in the output on the README. Those errors were not replicated when I ran the code locally. But the README needs to be updated without errors.
A: Thanks for noticing! This has now been fixed in later versions!
I would have preferred that the trivial example from the vignette coder actually be a simple example of classifying patients into groups based on ICD codes, rather than an unrelated example based on car makes and models. You could also show in this vignette how the ex_carbrands classcodes object was created. I found myself wondering that as I read the vignette.
A: I agree! The vignettes have been extensively rewritten and reorganized. Examples with car data have been removed. vignette("coder")
now includes examples based on the RxRisk V classification. Additional examples with Charlson and Elixhauser comorbidity are also included in other vignettes.
Under the "Interpretation" section, the listed names are mostly not in line with the actual output printed above. I assume you changed the contents of the example datasets and did not change the corresponding vignette text so this should be updated.
A: The vignette has been completely rewritten wherefore I hope this is no longer an issue.
What was the index based on in this example? i.e. how did the function categorize() know that the index should be a sum of the matched cars owned by each person? In the documentation for the function categorize it says that a value of NULL would lead to the use of all available indices, but what does that mean exactly? Available from where?
A: The vignette is rewritten and the documentation for the index
argument used by categorize()
has been updated:
Argument passed to index(). A character vector of names of columns with index weights from the corresponding classcodes object (as supplied by the cc argument). See attr(cc, "indices") for available options. Set to FALSE if no index should be calculated. If NULL, the default, all available indices (from attr(cc, "indices")) are provided.
In fact, the example on the README page is much more relevant and could easily be expanded to demonstrate the functionality in place of the car brand example. Maybe both are not needed.
A: I agree, and have done so!
I liked the output of the visualize() function, but seeing an example of an ICD code that would be identified by the given regular expression could be helpful to the less experienced reader. Presumably many users are familiar with ICD codes generally, but for the less experienced user this would be valuable.
A: I agree! I have introduced a new vignette("Interpret_regular_expressions")
where I have also presented a new summary.classcodes()
method as well as two functions coodebook()
and codebooks()
. Using those, it is much easier to get an overview of all codes being recognized by the relevant regular expressions.
I found the classcodes vignette a little hard to follow and found myself wondering throughout, "when would I use this?" and wishing for a concrete example that was used throughout demonstrating: this is what I want to do, and here is how I use these functions to do it.
A: I have re-organized all vignettes and have tried to include some concrete use cases.
The link to the Elixhauser co-morbidity index is broken
A: Thank you! I have changed the link to a formal citation of the relevant paper.
In the Risk Rx V example you mention creating dataframe ex_atc in which patients could have zero, one, or several codes prescribed. However, the way you create the data each patient in fact has at least one code. Either alter the example to match (preferable to demonstrate how patients without codes are handled) or alter the text.
A: Good point! I have updated to only include 90 out of the 100 patients. This (as well as the other example data sets) are now presented in vignette("ex_data")
. The data set ex_atc
has also been included in the package. The code to generate the sample is therefore no longer presented (but is available in data-raw/example_data.R
You mention surgery dates farther down in the vignette, but at the beginning only refer to a vague event date. This example would be improved by making it more specific. Who are the patients, what are the dates, what are we trying to do, and why?
A: I agree and have changed "event" to "surgery".
I noticed that when the default categorization was used, a nice message was printed in the R console stating what the classification was based on. This is a great use of messages. However, when a different classification was specified using cc_args this message was no longer printed. This could be standardized as I think the messaging is nice, and especially important when the user is changing it from some default.
A: You are correct that the message is only displayed if relying on the default setting. This was inspired by the join
-functions from the dplyr package, as described for the by
argument in the manual:
`If NULL, the default, *_join() will perform a natural join, using all variables in common across x and y. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.`.
This design pattern is also described and recommended in chap 15 of the Tidyverse design manual (https://design.tidyverse.org/def-inform.html)..)
I like this approach since I think the message is less needed if the user explicitly specified the setting. I have nevertheless borrowed the text above for the manual to be more transparent considering the underlying motif for this behavior.
Both the DESCRIPTION file and coder help file link only to the GitHub repo, even though this package does have a website. Make the link to the website easier to find since it's the best place for new users to start.
A: Thank you for noticing. This has been updated!
Good use of multiple points of entry. I found descriptions of the package through the help file, the README, the website.
A: Thank you :-)
The help file for classcodes mentions the "triad of case data, code data, and classification data." I found this to be a real "aha!" moment for understanding your package but this was the first place I saw this particular description. I think you could usefully focus more on this three-piece idea throughout the README, help files, and vignettes, with consistent terminology used throughout. Also, perhaps function and argument names could be altered/unified to reflect the trinity of information sources. I didn't find the names particularly intuitive as they are and was constantly having to reference them again when I was testing the functions.
A: Thanks for the suggestion! This is now (I hope) better explained in the package README, DESCRIPTION, main vignette and in the manual page for categorize()
.
I'd like to see examples that utilize every option available in each function, rather than only simple examples. I often find myself wanting to use a non-default option in a function and not finding any example that helps me understand how to specify the option. Having simple examples is essential, but also including a more complex example that utilizes more than just the basic function arguments will really help users with more complex problems or more advanced users. For example, the example given in the classcodes help file isn't very useful. The object you pass to as.classcodes() in the example is already a classcodes object. How would I use the argument hierarchy? I don't know because you haven't given an example.
A: I agree and have added examples which now should cover the use of all arguments etc.
Many Roxygen comments went beyond my 80 character line, making readability difficult. You could use carriage returns to improve readability of comments.
A: I agree and have changed accordingly.
Code is formatted well, and in line with tidy style guidelines.
A: Thank you very much!
First, the documentation for filter_dates should note that there is a default lower date limit. I had to go to the documentation of dates_within before I could understand the default behavior of filter_dates. Second, why is the default lower limit set to 1970-01-01? Does that have something to do with the coding schemes available within the package?
A: Those functions have been removed due to suggestions from @drgtwo (see below).
No example was included in the help file for set_classcodes No example was included in the help file for as.keyvalue.classcodes
A: All examples have been rewritten (see comment above). They should now cover all functions and arguments.
I see that codify throws an error if the values specified in the date argument aren't of class "Date". However I didn't see anywhere in the documentation information for the user specifying that dates need to be formatted specially. Typically when I read clinical data into R I end up with date columns that are formatted as character and so I would need to convert them to use these functions, so it might be worthwhile to include this detail somewhere.
A: Good point! This has now been specified!
There's an issue with the test-as.keyvalue file. It is currently saved as "test-as.keyvalueR" rather than "test-as.keyvalue.R". Also, this test file has an error. I believe instead of a \<- as.keyvalue.classcodes(elixhauser, "icd10se") you should have a \<- decoder::as.keyvalue(elixhauser, "icd10se"). And similarly within the test.
A: Thank you for noticing! I have changed the file name and the code!
There are at least two tests named "multiplication works" that are unrelated to multiplication: test-all_classcodes, test-as.keyvalue
A: Thank you! I have now removed all context()
-lines from the test files in accordance with the latest version of the testthat package.
The use of 365 days to indicate a year is not exact. Depending on the year, some dates could be missed due to leap years. I see that using windows in units of days is an easy coding solution here, but perhaps some more complex options for when time windows are specified in months or years, rather than days, could be supplied as well.
A: I agree in theory, although I think this is a rather complex issue. Rules for leap years are different depending on whether the year is divisible by 4, 100 and 400 for example. Adding leap seconds makes it even more difficult. Also the week numbers are treated differently in different parts of the world (a year might have 53 weeks in some countries but only 52 in others). Further complexity might consider different time zones (including some regions with 30 minutes intervals, summer times etc). The length of the year is also different in different religions (although the christian/Gregorian calender might of course be the norm :-). Altogether, I think more complicated time windows are better handled by dedicated date packages, which could be used explicitly by the user. A better approximation for the number of days of a year might be 365.241 (as advocated by some), but in this setting it does not really matter since time points were only recorded as dates (not time of day). I have therefore left this example as is.
I thought it was strange that codify did not return the same variable name for the date of interest as specified in data. Using your example data, ex_people has a date called event but when I combine it with ex_icd10 using codify the resulting data frame has a date variable named date rather than event. I would prefer to maintain the original variable name. This behavior does not occur in other functions; for example, categorize returns the original date variable name even when arguments are passed to codify.
A: I agree! This was a bug which is now fixed!
I expected categorize to be able to take the results of a call to codify as input. This could be a useful alternative specification option, and is being done internally anyhow
A: This is a good point, wherefore categorize
has now been made S3-generic with a dedicated method for objects of class codified
(a new class introduced for the output from codify()
)
I didn't have a large dataset of the type that might be used in this setting to realistically test the claim that the performance is fast.
A: Thank you nevertheless for the throughout review regarding all other aspects :-)
I got the following notes about example time and file size when I ran check on my local Windows machine:
installed size is 37.0Mb
sub-directories of 1Mb or more:
Meta 8.0Mb
R 3.0Mb
data 3.0Mb
doc 10.0Mb
help 5.0Mb
html 2.0Mb
Examples with CPU (user + system) or elapsed time > 5s
user system elapsed
codebooks 1.39 2.53 57.08
codebook 1.39 2.41 59.13
categorize 0.23 1.05 22.28
classify 0.13 0.28 7.01
A: I was not able to reproduce this, neither locally, nor by the GitHub actions set up for Windows, Mac or Ubunto. When I check the file sizes of the whole repo (excluding Git-related files) I have approximately 6 MB in total. The R folder for example is 94 KB. Wen I build the package locally I get a 349 KB source archive. I therefore leave this as is for now.
coder
takes a set of patients, links them to diagnosis codes, and links those to healthcare indices. From my brief experience in healthcare I know this is an important use case with many applications. While other packages (which the author cites perfectly in the JOSS manuscript) include tools for calculating , the ability to match many diagnosis codes at once, and to calculate multiple indices in the same way, make the coder package an important contribution to the field. Great job!The package follows the ROpenSci guidelines, is thoroughly tested and has chosen a high-performance approach. I've installed and checked it, and outside of the one bug described below (duplicate names to
codify()
) I found it to work for me on use cases.I have some recommendations for specific changes below, but I'm of course I'm open to dialogue.
A: Thank you very much! I really appreciate your throughout review and all the time and energy invested in it!
One thing that's missing from the documentation (for functions, the README, and the vignettes) is an example of what someone would do with the output of the package. The example in the "coder" vignette isn't healthcare-related, instead focused on a toy example involving patient cars. Perhaps replace it with a vignette that works through a "real" example, and shows at least one result from it that would be similar to what your next typical step would be. Besides being more interesting, it serves as documentation for the output: if you created a histogram of Charlson indices, it brings the user's attention to the importance of that column.
A: I agree! I have removed the toy example with car brands and have tried to include more realistic examples in the README and vignettes. Some histograms (or at least bar plots) with Charlson (and Rx Risk V) indices are also provided.
You could add something from the result of your
ex_people
andex_icd10
, but that join has only a single positive result (one patient with peripheral vascular disease). Of course real healthcare data is necessarily private, but you could instead consider taking a small sample of rows and columns from the SynPUF data, which is synthetic emergency room data including admissions and diagnoses. After using coder to determine what diagnoses occurred after an emergency room visit, the vignette could get one or two results from the data (e.g. showing the average Charlson comorbidity index of emergency room admissions over time, creating a histogram of them, or showing the most common diagnoses within the window). This would help communicate why it's helpful to annotate a dataset in this way.(You don't need to use a sample from SynPUF if you don't want to; you could also just construct the simulated data to have more joined diagnoses).
A: Thank you for really good suggestions! Unfortunately, I was not able to include the SynPUF data but I did find a previsously exported data set icd.data::uranium_pathology
which I have now modified as part of the internal example data sets ex_icd10
and ex_people
. I think those data sets should be more realistic.
The README is solid, but does jumps immediately into examples of simulated data, without discussing why someone might want to join diagnoses in the previous year. This doesn't have to take much text; it could be a short bulleted list of use cases like "Discovering adverse events after surgery"/"Determining comorbidities before clinical trials." It's likely that users already know their use case, but an example lets them recognize it and think "this package is for me!"
A: I agree and have substantially expanded the README. The suggested sentences are now also included under the "Typical use case" section.
A second piece of advice with the README is to start with an example that doesn't include a date column before showing the one that does, to ease the user into the use of the
categorize
function with relatively few arguments.
A: Thanks for the suggestion, which has now been included in the README!
The comorbidities vignette and the JOSS paper are very well done in terms of giving the appropriate level of background and describing the use case.
A: Thank you very much! I appreciate it!
If there are duplicate names in the
data
passed tocodify()
, it returns a data.table error that isn't informative toward fixing the problem. (categorize()
does catch this with "Non-unique ids!" but notcodify()
).
people_doubled <- rbind(ex_people, ex_people)
codify(people_doubled, ex_icd10, id = "name", date = "event", days = c(-365, 0))
A: Thank you for noticing! The message should now be the same as for categorize()
.
More importantly, don't there exist use cases for
categorize
where there are multiple events for the same patient, with different dates? Examples could include adverse events after starting multiple lines of therapy, or comorbidities before multiple diagnoses. In those cases, doesn't it make sense to return one row for each event, even if there are multiple for a patient? Should the check only error out when there are duplicate name/date pairs?
A: I agree that such a feature is relevant. The problem, however, is that unit data is matched to code data based on the index variable and that I cannot perform such matching based on the date column (which would be a non-equi-join, as allowed for some data.table
operations but not in merge
which is currently used). Although this would be possible after some factoring of internal functions, I think it is currently better to perform such operations using standard functionality outside the package, such as with x %>% group_by(y) %>% codify(...)
for dplyr
or x[, codify(...), by = y]
with data.table
.
I think the
as.codedata()
approach can be improved to make the package more understandable and usable. Some issues:
By convention
as.X
functions in R return an object of class X, but this returns a data.table.
codify()
describes the second argument as "output from as.codedata", but the function still works if given a data frame, data.table, or tibble.By default,
as.codedata()
filters out dates in the future and dates before 1970. I assume this is meant to remove bad data, but isn't it better to leave such data quality filters to the user? As it is, the user must go through a few pages of documentation (codify/categorize -> as.codedata -> dates_within) to learn about this behavior. And in any case where there's a date window, the extreme date values won't affect the coding anyway.It looks to me like the main reason for
as.codedata()
is to speed up the function by making it a data.table and setting keys. But you could do this withincodify()
as well; the only advantage this provides is if you run many codings with different ids/dates (or different arguments) while keeping the code data the same. I've done some benchmarking and it looks like the improvements become visible (in tens of milliseconds) when there are around million coded events.Do we expect it to be common for users to run the package with millions of coding events, where the codedata stays the same while the input events change, and in an environment where fractions of a second matter? Is this common enough to be worth imposing extra instructions on every user of the package?
My recommendation is
- Don't export
as.codedata
, and instead do the preprocessing/checking of codedata within the codify function instead of suggesting that the user useas.codedata
A: I agree and have now integrated as.codedata()
into codify()
as suggested!
- In the documentation for
codify
andclassify
, as well as your documentation examples, describe the codedata input as "a table with columnsid
,code
, and optionallydate
."
A: The documentation has been improved. I have changed the requirements that the codedata must have columns named id
, code
and code_date
. Instead, the names of corresponding columns should now be specified as arguments in relevant functions.
- If you're very confident that performance when keeping the codedata the same and trying many different datasets is important, you could add a function called
classcodes_prepare
orprepare_classcodes
that does the conversion to data.table and sets the indices, and could describe that in thedetails
section of the documentation. But I'd want to understand why that's a typical use case.
A: See previous comment (as.codedata()
removed in favor of a re-factored codify()
).
Relatedly, I recommend removing (or at least making internal)
dates_within()
andfilter_dates()
. Their purpose (applying a filter on dates with some defaults) has no relationship to the rest of the package, and is something the user can do themselves with tools they're accustomed to (base R, data.table, or dplyr).
A: I agree and have removed those functions. They were previously included since I was unaware of data.table::IDate
and that data.table::between
is optimized for date comparisons. The problem with normal date comparisons is that those can be really slow for big data.
regex_ in column names when tech_names = TRUE
The output of
categorize()
on a table returns columns with spaces in their names. This isn't well set up for additional analysis, since it makes it difficult to do any kind of programming with them, including using data.table to filter for one diagnosis or to aggregate the percentage of patients (perhaps within each group) that have a condition. It's nice for displaying the names in a table, but is it a common use case to display individual patients in a table (as opposed to aggregated statistics?)It seems like the
tech_names
argument is designed to fix this, but it leaves prefixes likecharlson_regex_
on every column name, which will need to be removed for meaningful downstream analysis. How about removing thecharlson_regex_
, or at least theregex_
, in these cases? (Indeed, is there a reason that thecharlson
classcodes object itself has to have theregex_
prefixes? It already has an attributeregexprs
that includes those column names). Besides which, perhaps consider leavingtech_names
to default to TRUE for the reasons described above.
A: Good point! I have made several changes:
classcodes
object no longer have column prefixes (reg|ind)ex_
.
I have introduced a new print.classcodes()
method for a better default display of classcodes where regex and indices are identified by a heading and not by column names prefixes
categorize()
has a new argument check.names
(same as data.frame
/data.table
). This argument is TRUE
by default, making the column names syntactically correct (using dots instead of spaces). The original names (possibly with spaces) are received by check.names = FALSE
, which might sometimes be useful.
The reason for the long names implied by tech_names
is that categorize
is sometimes used multiple times, for example to enhance a data set with both comorbidity and adverse events. To group such variable names by common and descriptive prefixes might then be useful (due to tab completion etc). But the (reg|ind)ex_
part is no longer included and the need to use those longer names has probably decreased by the use of the check.names
argument.
tibbles and data.tables
Your examples like
ex_people
are tibbles, but whencategorize()
orcodify()
is passed a tibble, it returns a data.table. This would be a surprising behavior for people using these packages within a tidyverse workflow. I think data.table is a terrific package, but there's not a reason to surprise users with the data type if they're not accustomed to it. (And the fact that the example datasets are tibbles rather than data.frames or data.tables adds to the inconsistency a bit).I recommend ending the functions with something like
# Where data was the argument passed in, and ret is what's about to be returned if (tibble::is_tibble(data)) { ret <- tibble::as_tibble(ret) }
This would mean that it returns a data.table when it's passed a data.frame or data.table, but a tibble if and only if it's passed a tibble. Admittedly, this requires adding an import for tibble (which perhaps is why it wasn't done), but since tibble is imported by 800 CRAN packages (including dplyr + ggplot2, each depended on by \~2000 packages) it's a fairly low-impact dependency. This also doesn't strike me as a utility package that will frequently be installed in production systems; it's a scientific package that would typically used with other data analysis tools. I think there are some useful thoughts on tibble dependencies here.
A: I agree and have made the following changes:
included tibble as a dependency
all_classcodes()
now returns a tibble.
categorize()
has been re-factored into an S3-generic and returns data sets of the same class as the input (data.table/data.frame/tibble).
as.classcodes()
returns as tibble with an additional class attribute.
codify()
is often used to return large data sets (several millions of rows) which should be used in a following step by classify()
. I therefore think that the data.table
format is preferred here. I have re-factored the function into S3-methods, however, to treat input as data.frame/data.table/tibble in a better way. I have also implemented a print.codified()
method, which prints the first n
rows as a tibble (possible to override with print(..., n = NULL)
, which will print the object as is [a data.table]). I think this might be a good compromise for most users. I have also clearly stated that the preview is simply a preview as part of the output from the print method.
classify()
returns a matrix for efficiency, since this object is always logical/boolean. I think this should be kept as is. The two methods as.data.frame.classified()
and as.data.table.classified()
should make it relatively simple to convert the output if desired (including to tibbles inhereting from data.frame).
summary.classcodes()
returns a tibble, which is also printed as such through print.summary.classcodes()
.
Relatedly (though less important), the example datasets don't print as tibbles by default. If you follow the instructions in
usethis::use_tibble()
, you could support printing it as a tibble even when the tibble/dplyr packages aren't loaded. The additional advantage of this is that you could get rid of most of the uses ofhead()
in the README, making your examples more concise and focused on your use case.
A: I agree and have changed accordingly!
Naming
index
and especiallyvisualize
are very generic names for very specific functions, and doesn't give any hints about what they're used for. How aboutvisualize_classcodes
?
A: The name visualize
was actually a half-baked S3-method for generics::visualize()
with the aim to "Visualize a data set or object". It has now been changed into an actual S3-method to fit with this generic.
I am less confident that the index()
function is that specific since its meaning will differ depending on different classcodes
object. The terminology is borrowed from the typical use case of calculating a comorbidity index, which was the initial motif for developing the package ( "Charlson comorbidity index", etc). Similar verbs would be for example "aggregate", "count" or "summarize" but those would both be uninformative and collide with other common functions. index()
on the other hand is not used in any widely used package as from what I have seen. I guess calculate_index()
or similar would work as well but it just seems longer without any convincing reason :-)
- An alternative for function naming is to have a common prefix for functions, e.g.,
coder_classify
,coder_categorize
,coder_index
,coder_codify
,coder_visualize
. This has both the advantage of ensuring it doesn't overlap with other packages and making it easy to find codify-related functions with autocomplete. But that's just a suggestion.
A: Thank you for the suggestion! I agree that this convention can be useful. I do think it is almost as simple to use autocomplete with the coder::
prefix instead of coder_
, however. The difference is just one additional key stroke :-) and I would prefer shorter names without this redundancy coder::coder_xxx
.\
This naming convention is also discussed in the Tidyverse design guide, section 5.2 (https://design.tidyverse.org/function-names.html#function-families) with the following paragraph:
Not sure about common prefixes for a package. Works well for stringr (esp. with stringi), forcats, xml2, and rvest. But thereβs only a limited number of short prefixes and I think it would break down if every package did it.
I agree with Noam that coder isn't an ideal package name, if only because it makes the online resources a bit harder for users to find. Try Googling "coder", "R coder", or "R coder github"! But if it's too late to change the name, I don't consider it a dealbreaker.
A: Thank yo for the suggestion. I do understand the possible downsides of the name and I partially agree. On the other hand, try Googling the "sister" package "CRAN decoder", which does find the decoder
package easily. Consider also some popular RStudio packages such as: generics
, baguette
, glue
, haven
, parsnip
, tune
, reciepies
, dials
, workflows
and yardstick
, none of which are very Googlable on their own. I thus hope that a later CRAN release will make Googling simpler :-)
Thanks for your response, @eribul. @zabore and @dgrtwo, could you please look at @eribul's chages and let us know if they address your concerns? Thank you!
@eribul You have clearly done a lot of work on this package in response to the reviews - well done! The changes you have made to the README, vignettes, and documentation in particular have really improved the package. I am satisfied with your responses to all of my comments, and don't have any further comments.
@dgrtwo, please take a look at @eribul's changes and let us know if they address your comments. Thanks!
Thanks for your very hard work on an already strong package!
I'm especially impressed by the README and vignettes. Both now include several practical examples, and I think the visualizations and "stories" around them make it really clear how the package can be used. I also appreciate that you agreed with and implemented many of my suggestions (as.codedata, returning a tibble, etc). I'm certainly satisfied with the package being added to ROpenSci.
A few remaining thoughts:
devtools::spell_check()
useful πAnd a few that don't require any action or response:
codify
returning a data.table: If I were implementing it I think I would instead have had the following step, classify
, turn the the table into a data.table (I think the table->data.table transition would be fast on even a large table), but your compromise is still entirely reasonable.Approved! Thanks @eribul for submitting and @zabore and @dgrtwo for your reviews!
To-dos:
pkgdown
website and are ok relying only on rOpenSci central docs building and branding,
pkgdown
website with a redirecting pagehttps://docs.ropensci.org/package_name
URL
field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname)
. If Appveyor does not pick up new commits after transfer, you might need to delete and re-create the Appveyor project. (Repo transfers are smoother with Travis CI and GitHub Actions)To pass this package to JOSS:
This package has been reviewed by rOpenSci: https://LINK.TO/THE/REVIEW/ISSUE, @ropensci/editors
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"
-type contributors in the Authors@R
field (with their consent). More info on this here.
We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for maintaining your package after peer review.
Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions. I think your package would make for an interesting post - rOpenSci has a lot of people who think about metadata, ontologies, and linked data, but for historical reasons they've been based in environmental and ecological fields. It would be good to get some perspective and vocabulary from a clinical medicine perspective.
Thank you very much all of you @noamross, @dgrtwo and @zabore ! I really appreciate the feedback and your positive decision :-) I will do my best to complete the task list!
It seems I am not able to check the boxes above, wherefore I copied the list:
If you already had a pkgdown
website and are ok relying only on rOpenSci central docs building and branding
pkgdown
website with a redirecting pagehttps://docs.ropensci.org/package_name
URL
field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
[x] Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname)
. If Appveyor does not pick up new commits after transfer, you might need to delete and re-create the Appveyor project. (Repo transfers are smoother with Travis CI and GitHub Actions)
To pass this package to JOSS:
This package has been reviewed by rOpenSci: https://LINK.TO/THE/REVIEW/ISSUE, @ropensci/editors
DESCRIPTION
"rev"
-type contributors in the Authors@R
field (with their consent). More info on this here.We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for maintaining your package after peer review.
Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions. I think your package would make for an interesting post - rOpenSci has a lot of people who think about metadata, ontologies, and linked data, but for historical reasons they've been based in environmental and ecological fields. It would be good to get some perspective and vocabulary from a clinical medicine perspective.
@zabore and @dgrtwo I would very much like to acknowledge your work as reviewers by the rev
-role in the description file:
person("Emely C", "Zabore", role = "rev",
comment = "Bea reviewed the package (v. 0.12.1) for rOpenSci, see <https://github.com/ropensci/software-review/issues/381>"),
person("David", "Robinson", role = "rev",
comment = "Bea reviewed the package (v. 0.12.1) for rOpenSci, see <https://github.com/ropensci/software-review/issues/381>"),
Do I have your permissin to do so? :-)
@noamross I have transferred the repo to ropensci but might have missed to select the "coder" team. It seems then I have not access to the usual "settings" for the repo, wherefore I can not "replace [my] package docs URL with https://docs.ropensci.org/package_name". Is it possible to get the relevant access in hindsight?
Dear @stefaniebutland, I would be happy to contribute with a blog post describing this package, if you might find it relevant :-)
@eribul Happy to be included as a reviewer role. And congratulations on this package publication!
@eribul Congratulations on coder passing peer review! Would love to have a post. We have openings for publication in late January. Please suggest a date for submitting a draft via pull request and I'll provide a publication date. https://blogguide.ropensci.org/ gives content and technical guidelines.
@eribul You should have admin access now.
@eribul in your post, you should talk about where these codes/categorizations come from, what they mean and how they are used for a general open science that may not be familiar with medical/clinical data, and how his package solves a problem researchers have with using them.
@zabore and @dgrtwo I would very much like to acknowledge your work as reviewers by the rev-role in the description file:
Sure!
@stefaniebutland Nice! That sound like a great outline :-) What about 2021-01-25?
sound like a great outline
Noam's idea :-)
Draft by 2021-01-25 is great. My colleague @steffilazerte will review it.
Submitting Author: Erik BΓΌlow (@eribul)
Repository: https://github.com/eribul/coder Version submitted: 0.11.9 Editor: @noamross
Reviewer 1: @zabore
Reviewer 2: @dgrtwo
Archive: TBD
Version accepted: TBD
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
The aim of the package is to categorize items/individuals/patients from one large data sets, with the help of external code data using a dynamic system of classicifaction schemes based on regular expression. There are no analytical/statistical methods included.
The primary user group might be medical/epidemiological researchers classifying patients by medical scores such as comorbidity indices or adverse events. The package might be used more broadly however due to generic principles.
There are some R packages with a narrow focus on Charlson and Elixhauser co-morbidity based on ICD-9/10-codes (
icd
,comorbidity
,medicalrisk
,comorbidities.icd10
andicdcoder
).icd
andcomorbidity
are both good for their purpose.medicalrisk
can be used with ICD-9-CM codes but is not up-to-date with the latest version of ICD-10.comorbidities.icd10
andicdcoder
are not actively developed or maintained. Thecoder
package provides greater flexibility for combining different sets of codes (ICD-8, ICD-9, ICD-9-CM and ICD-10 etc as given by regular expressions, either as included in the package by default, or as provided by the user). All other packages relies on pre-specified (hard-coded) classifications. None of the other packages includes any other classification then Charlson/Elixhauser (such as the RxRisk V classification based on medical ATC codes, which is also included in thecoder
package by default).Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
JOSS Options
- [x] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [x] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct