Currently we store the first instance that appears in the TSV, e.g., for nb:Diagnosis
One example where this is bad:
The QPN TSV contains two columns about diagnosis, one of which has values PD and PSP (progressive supranuclear palsy), and the other column classes individuals as PD or healthy control. Both were annotated in the data dictionary, but only the PD/PSP diagnosis column is in the graph right now due to order of appearance in the raw data.
So in short: a subject who has no data on the first column (i.e. missing value), but has data on the second column (i.e. is a healthy control) will still not show up as a healthy control. -> not good
Desired treatment of multi-column annotation
According to our current data model, a subject can have a list of multiple diagnosis or assessments, but only one participant ID, age, or sex (?).
Multiple columns IsAbout an assessment is already supported ✔️
Multiple columns currently can't be IsAbout diagnosis ❌
We should update the phenotypic column handling logic to return a list of values for a set of columns annotated as being IsAbout the same attribute (currently, this function is called for sex, diagnosis, and age, but not assessment):
Storing only the first ('transformed') value for variables that do not support multiple values (age, sex) should be done via conditionals outside of this utility function.
isSubjectGroup, for now, should remain mutually exclusive with any diagnoses. So, if the list of values for columns about diagnosis contain at least one instance of healthy control, we say the subject isSubjectGroup and do not assign any diagnoses.
Steps to implement
[x] add new example with multiple diagnosis columns
[x] update README
[x] update existing relevant tests to use the new example
[x] change get_transformed_values to return a list
[x] ensure that other attributes that also rely on this function but accept only a single value in the data model do not break
[x] add integration test that output for a multicolumn diagnosis annotation is handled appropriately
[x] create example with multiple values for age & sex and check that bagel pheno smoke test still passes (model should complain if values from >1 column ended up in the subject data)
[ ] (maybe) add explicit warning when multiple age/sex columns are detected, stating that we only will consider the first one
Context
Currently we store the first instance that appears in the TSV, e.g., for
nb:Diagnosis
One example where this is bad: The QPN TSV contains two columns about diagnosis, one of which has values PD and PSP (progressive supranuclear palsy), and the other column classes individuals as PD or healthy control. Both were annotated in the data dictionary, but only the PD/PSP diagnosis column is in the graph right now due to order of appearance in the raw data.
So in short: a subject who has no data on the first column (i.e. missing value), but has data on the second column (i.e. is a healthy control) will still not show up as a healthy control. -> not good
Desired treatment of multi-column annotation
According to our current data model, a subject can have a list of multiple diagnosis or assessments, but only one participant ID, age, or sex (?).
IsAbout
an assessment is already supported ✔️IsAbout
diagnosis ❌We should update the phenotypic column handling logic to return a list of values for a set of columns annotated as being
IsAbout
the same attribute (currently, this function is called for sex, diagnosis, and age, but not assessment):https://github.com/neurobagel/bagel-cli/blob/4da00b6db4cce30d40f101c0c4e17be25db3828f/bagel/pheno_utils.py#L213-L238
Storing only the first ('transformed') value for variables that do not support multiple values (age, sex) should be done via conditionals outside of this utility function.
isSubjectGroup
, for now, should remain mutually exclusive with any diagnoses. So, if the list of values for columns about diagnosis contain at least one instance of healthy control, we say the subjectisSubjectGroup
Steps to implement
get_transformed_values
to return a listbagel pheno
smoke test still passes (model should complain if values from >1 column ended up in the subject data)