ropensci / coder

Classification of Cases into Deterministic Categories
https://docs.ropensci.org/coder/
22 stars 4 forks source link

Duplicate names to codify #116

Closed eribul closed 4 years ago

eribul commented 4 years ago

If there are duplicate names in the data passed to codify(), it returns a data.table error that isn't informative toward fixing the problem. (categorize() does catch this with "Non-unique ids!" but not codify()).

people_doubled <- rbind(ex_people, ex_people)
codify(people_doubled, ex_icd10, id = "name", date = "event", days = c(-365, 0))

More importantly, don't there exist use cases for categorize where there are multiple events for the same patient, with different dates? Examples could include adverse events after starting multiple lines of therapy, or comorbidities before multiple diagnoses. In those cases, doesn't it make sense to return one row for each event, even if there are multiple for a patient? Should the check only error out when there are duplicate name/date pairs?

eribul commented 4 years ago

Review: Thank you for noticing! The message should now be the same as for categorize(). I agree that such a feature is relevant. The problem, however, is that unit data is matched to code data based on the index variable and that I cannot perform such matching based on the date column (which would be a non-equi-join, as allowed for some data.table operations but not in merge which is currently used). Although this would be possible after some refactoring of internal functions, I think it is currently better to perform such operations using standard functionality outside the package, such as with x %>% group_by(y) %>% codify(...) for dplyr or x[, codify(...), by = y] with data.table.

eribul commented 4 years ago

Tänk till ändå om vi kan lösa det!

eribul commented 4 years ago

Fast nej ...