tech_names - Githubissues

eribul commented 4 years ago

The output of categorize() on a table returns columns with spaces in their names. This isn't well set up for additional analysis, since it makes it difficult to do any kind of programming with them, including using data.table to filter for one diagnosis or to aggregate the percentage of patients (perhaps within each group) that have a condition. It's nice for displaying the names in a table, but is it a common use case to display individual patients in a table (as opposed to aggregated statistics?)

It seems like the tech_names argument is designed to fix this, but it leaves prefixes like charlsonregex on every column name, which will need to be removed for meaningful downstream analysis. How about removing the charlsonregex, or at least the regex, in these cases? (Indeed, is there a reason that the charlson classcodes object itself has to have the regex prefixes? It already has an attribute regexprs that includes those column names). Besides which, perhaps consider leaving tech_names to default to TRUE for the reasons described above.

eribul commented 3 years ago

I could in fact strip ne regex_ prefix rom the classcodes objects. One reason they are there is that the attributes are in fact based on those names during the construvtion of the objects. It might be reasonable to remove them after those attributes are set, but before exporting the final object.

eribul commented 3 years ago

Review: Good point! I have made several changes:

classcodes object no longer have column prefixes (reg|ind)ex_.
I have introduced a new print.classcodes() method for a better default display of classcodes where regex and indices are identified by a heading and not by column names prefix
categorize() has a new argument check.names (same as data.frame/data.table). This argument is TRUE by default, making the column names syntactically correct (using dots instead of spaces). The original names (possibly with spaces) are recieved by check.names = FALSE, which might sometimes be useful.

The reason for the long names implied by tech_names is that categorize is sometimes used multiple times, for example to enhance a data set with both comorbidity and adverse events. To group such variable names by common and desriptive prefixes might then be useful.

eribul commented 3 years ago

This is also related to #130

ropensci / coder

tech_names #120