ropensci-archive / cleanEHR

:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC
GNU General Public License v3.0
54 stars 23 forks source link

JOSS review #139

Closed sinanshi closed 6 years ago

sinanshi commented 7 years ago

Review Comments

This package allows users to extract and clean a very unique piece of clinical data for the CC-HIC datasets. Since the original dataset contains multiple complicated forms of data, it could be a very complicated process to start everything from scratch. I can imagine this package can be very helpful for many useRs in that system.

I do have a few comments after trying out this package.

Trop compliqué

sinanshi commented 7 years ago

Aammd

This is an intricate and impressive package! It involves data.table, yaml, S4 classes and XML. ; it is the All Dressed chip of data accessing packages! Unfortunately, I am not much of an expert in data.table, nor in the creation and documentation of S4 classes. I have mostly reviewed for usability, documentation and some coding best practices Package Review

*At the risk of seeming like a meat-based goodpractice, I wanted to echo two comments made above:

warnings and automatic checks

spelling errors:

WORD FOUND IN anonymisation description:5 charcter data.quality.report.Rd:15 chuncks update_database.Rd:15 conf ccTable-class.Rd:39,45 datam extractIndexTable.Rd:12 datat extractIndexTable.Rd:11 demg demg.distribution.Rd:5,16 demograhic lenstay.Rd:10 dfilter getfilter.Rd:5,15 doesn selectTable.Rd:16 lenght lenstay.Rd:5,20 organised new.episode.Rd:21 structre ccTable-class.Rd:10 Subseting sub-sub-ccRecord-method.Rd:6,16 yaml ccTable-class.Rd:54 YAML ccTable-class.Rd:19, create.cctable.Rd:14, create2dclean.Rd:12,23

R CMD CHECK returned this warning:

Undocumented data sets: ‘icnarc_table’ All user-level objects in a package should have documentation entries.

sinanshi commented 7 years ago
sinanshi commented 7 years ago

JOSS Review

Dear reviewers,

Thank you so much for your very helpful comments, and we are very sorry to be spending such a long to time to address these comments. We have made several re-structuring of the code, inspired by your comments.

To Reviewer 1 @haozhu223

I personally think this package agrees with rOpenSci's fit policy and is nice to be listed. However, I also want to note that maybe rOpenHealth fits this topic better?

We are not clear about the nature of rOpenHealth, since we didn’t find too much information online. Is it a subsidiary of rOpenSci? Can we be listed in both organisation? We would love to explore the possibilities here.

I do have a few comments after trying out this package. As @noamross mentioned, regarding the package name, I would recommend changing the name to cchicr or else instead of cleanEHR due to the fact that this is only one particular form of electronic health record.

The critical care data (CC-HIC) is one of the themes of National Institute for Health Research (NIHR) Health Informatics collaborative (HIC). It has other themes which covers a large area of health care topics. https://www.nihr.ac.uk/about-us/how-we-are-managed/our-structure/infrastructure/health-informatics-collaborative.htm cleanEHR was designed as an extendable data cleaning tool. CCHIC is the first exemplar for it and we planed to extend cleanEHR functionality to other themes and link with other data sources.

rOpenSci packaging guide recommends using xml2 instead of XML for parsing XML files. It seems to me that most of the XML-related codes in this package only involves parsing. As a result, I would recommend using xml2 instead.

Thank you for the suggestion and you have the point. We do believe xml2 will be a better fit to the XML parser. However, the XML parser is the most complicated part of the code. The XML format we receive from hospitals do not follow strictly the standard. As a result, there are many corner cases that we fixed throughout the years of development. The XML parser is rather like a plugin which serves only with CCHIC dataset. Concerning the complexity of re-writing it with xml2, will it be OK if we leave the xml part this time, and redo them when the new xml standard is introduced in the coming years?

This package is frequently using a very OO-ish programming style to define objects and write functions. For example, after a ccTable being created, you can use cctable$export.csv("file.csv") to write contents to a csv file (https://github.com/CC-HIC/cleanEHR/blob/master/R/ccTable.R#L206-L213). To me, due to my lack of understanding of the full scenario, it seems sort of like forcibly writing python in R and is a little redundant. (I'm not saying it's a bad practice. My concern is mainly about the lack of documentations for these methods as I said below)

Thanks for pointing it out. ccTable is a Refclass which is served as the data cleaning and manipulation platform. Since the tables it held are usually very big, we intented to use refclass to avoid multiple copying. As we have adopted the OO way, we would like to keep all the ccTable related function, e.g. export.csv to be consistent. More documentation has been made, I hope it is clearer this time.

Also, it would be nice if we can see more documentations for these self-defined methods or at least a table of available methods with brief explanation.

Thanks, we have added more documentation for the ccTable member methods, which are desigated as ccTable_methods.

I would like to see more documentation about pipeline. I feel like the current documentation in R/pipeline.R and inst/pipeline isn't quite enough. It would be nice to see a README in inst/pipeline.

As we mentioned above, the pipeline is removed from the package.

In the package vignette tour.Rmd, we see tb <- create.cctable(ccd, yaml.load(conf), freq=1). However, freq is actually the 2nd argument in that function. It might mislead some users in the future.

Amended, thanks.

I feel like the author can put more instructions in the package vignette.

New vignette has been made, thanks.

Final point, I'm not sure if it's a good idea but I feel like building a shiny app to deal with these tasks will be very helpful, especially to help users to understand what kinds of functionalities are available in this package.

It is really a good point! We have the plan of making a website using shiny for data displaying and cleanEHR functionalities. It will be done as a separate project.

To Reviewer2

The authors might consider using a prefix for every function in the package. Many successful packages use this strategy, which takes advantage of text editors which use tab completion. Here, perhaps functions which use a certain class of object could be prefixed with that class name, ccd_episode_plot instead of episode.graph etc.

Thank you for your suggestion. We recognise that the naming is a huge problem. As the package started with a lot of experimental elements, naming has not been paid with too much attention. We amended the package to adopt the naming style using prefix where applicable. In particular, the episode.graph() function has been moved to the plot() method for ccEpisode.

Some names and terminology have the potential to be confusing to users. For example sql.demographic.table has nothing to do with sql per se, but rather arranges tables such that SQL import would be possible.

sql.demographic.table has been renamed as ccd_demographic_table.

Some names are probably only confusing for the programmers. there is a mix of cases used in function names (sql.demographic.table(ccd), extractIndexTable, unique_spell)

Amended accordingly.

it is better to use TRUE and FALSE, in place of T and F.

Modified, thanks.

it is also safer to use vapply in place of sapply or unlist(lapply).

We didn’t manage to switch for all the cases, but we changed a couple of them.

I wonder if some of the functions could be split apart. For example, episode.graph creates both a plot and a rearrangement of the data. Perhaps if the plotting code were moved to a plot() method for the ccRecord class

This is a great idea and we moved it to plot() for ccEpisode class now.

I note that all of Rcpp is listed as an Import, but only the function evalCpp was imported directly. Is loading the entire package necessary?

Amended accordingly. Thanks.

The Vignette contains an example of cleaning data, based on instructions in a YAML code block. create.cctable(ccd, yaml.load(conf), freq=1). I find using the yaml.load function here to be a bit awkward, and difficult for users who are unfamiliar with yaml. It might be easiest to encourage users to write a short file, conf.yaml, and to direct create.cctable to this filename, rather than to an R object.

In fact, the function can accept both list and the path of a YAML configuration file as a input. You are right, it might be confusing to use yaml.load in the vignette. In the new version of the vignette, we demonstrated it without using yaml.load.

I feel like pander and ggplot should be listed in Suggests, rather than Imports. Then users could use the basic functionality of this package without necessarily installing these packages. This might also entail separating where these functions are used from the data-cleaning that sometimes precedes them.

We actually feel the data quality reporting and visualisation, where pander and ggplot are used, are crucial to the data cleaning process. For this reason, we feel it is perhaps necessary to make these functions loaded from the beginning. Please let me know if you feel otherwise. We can discuss about this.

Some documentation, e.g. in ccRecord, refers to S3 classes when in fact these are S4 classes.

That is a mistake. Amended!

There seem to be some vestigial R scripts and other files in inst/sql and inst/report, could these be deleted, or perhaps moved to a development branch?

Unnecessary files removed.

I ran all the code in the Vignette, and found that sql.demographic.table(ccd) produced the following error for me (first sentence only): Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference.

This is probably a warning from data.table. In some R version it happens. But you can still run through.

I ran devtools::spellcheck, and found the following errors (among other, trivial "errors" like variable names etc):

Amended.