Although it limits number of comorbidities per visit/patient, many people have 'wide' format ICD9 data, e.g.
visitId ICD_01 ICD_02 ICD_03 ...
PT123 4411 V1001 E8012 ...
PT789
...
Now that the allocation of comorbidities itself is so fast, the slowest step is setting up the vector of vectors of int values containing the icd9 codes, primarily because we need to search the list of visitsIds as we progress to check for duplicates, although there is an optimization for the case of the visitId being the same as the previous. The initial 'wide' structure could be coded more quickly to do this, and wouldn't (necessarily) require checking for duplicated visit IDs. We would still pay the price of converting these to factors, if they are not already. Factors for these often duplicated codes makes a lot of sense, since there are many duplicates.
The factor levels for all the columns wouldn't necessarily be the same, in fact the ICD_29 code, for example is likely to be relatively unpopulated, and probably have far fewer levels. The factor levels would have to be made consistent across the ICD columns (and consistent with the mapping, but this I do anyway, reducing the mappings only those codes I am actually going to assign.
The current way of doing this would be to call icd9WideToLong then icd9Comorbid. It would be better to have icd9ComorbidFromWide which would take a data frame and some information about how the columns are named. People are much less likely to have this in matrix format, so don't cover this case.
Although it limits number of comorbidities per visit/patient, many people have 'wide' format ICD9 data, e.g. visitId ICD_01 ICD_02 ICD_03 ... PT123 4411 V1001 E8012 ... PT789 ...
Now that the allocation of comorbidities itself is so fast, the slowest step is setting up the vector of vectors of int values containing the icd9 codes, primarily because we need to search the list of visitsIds as we progress to check for duplicates, although there is an optimization for the case of the visitId being the same as the previous. The initial 'wide' structure could be coded more quickly to do this, and wouldn't (necessarily) require checking for duplicated visit IDs. We would still pay the price of converting these to factors, if they are not already. Factors for these often duplicated codes makes a lot of sense, since there are many duplicates.
The factor levels for all the columns wouldn't necessarily be the same, in fact the ICD_29 code, for example is likely to be relatively unpopulated, and probably have far fewer levels. The factor levels would have to be made consistent across the ICD columns (and consistent with the mapping, but this I do anyway, reducing the mappings only those codes I am actually going to assign.
The current way of doing this would be to call icd9WideToLong then icd9Comorbid. It would be better to have icd9ComorbidFromWide which would take a data frame and some information about how the columns are named. People are much less likely to have this in matrix format, so don't cover this case.