Closed surchs closed 1 year ago
Missing conditions. ds000115
has a condit
column with what looks like clinical conditions. There are levels in there that we cannot model, but we want to model the other levels. Example:
SCZ-SIB
-> probably sibling with schizophrenia -> cannot modelSCZ
-> probably participant with schizophrenia -> we can model
So what do I do about the things we can't model? Our constraints are:Problem with the way we model assessment tools. Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool. That's not great for several reasons:
A participant with two conflicting diagnoses is currently hard to model. Example:
ds000144
has SADDx_p
for social anxiety at preschool and SADDx_f
social anxiety at follow-up, but only one imaging session with an unrelated name. If one is "Yes" and the other "No", how should we display it in our model? This is probably mainly a problem for pure BIDS datasets that are doing these wonky hacks to introduce multi-session information into the phenotypic file even though they are not allowed to. so maybe a good response is to say: can't do it -> use BIDS-pheno or MR_proc.
Having to look up the controlled terms by hand is pretty annoying (and probably prone to error). If we turn this into a workflow, it will absolutely have to have the possible values pre-configured.
measurement that isn't a tool. For example: ds000201
has BMI1
for Body Mass Index on session 1. Cool. But cogatlas doesn't have that. Makes sense, it's not a tool. More like "heart rate" or "height". Still, something we'd probably like to annotate. So what should we do with these things? SNOMED has them, but then we mix vocabularies...
missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201
has HADS_Anxiety
which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006
, even down to the subscale. Hhmm...
clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations.
categorical variable encoded with numbers When data owners describe a cateogorical variable with numbers, I cannot:
ds000144
I have SADDx_p
encoding presence of social anxiety.
From the data dictionary I can see that it is encoded as Levels
key in the data dictionary. If the data dictionary was incomplete or missing, we would have to find another way to address this. This will not be a problem for the annotation tool because the user will first tell us what the column is about, so that we can pick the correct workflow from that information.need to look inside numeric column to categorize
Some continuous columns you actually want to look inside of. If a column has ages, I don't want to see every unique value of course. But people use continuous values for categories and other other things that aren't very obvious.
For example: in ds000115
there is a numeric column saps7
. Most likely that has something to do with the Scale for the Assessment of Positive Symptoms for Schizophrenia. But what does the 7 mean? Would be good to look inside of this column. If we end up turning this process into a bulk tool, there should probably be a way to inspect even numeric columns.
Tool name not recognizable. Some datasets annotate their data with the measured concept rather than the name of the tool. For example "Handedness" instead of "Edinburgh Handedness Inventory". I'm not sure how to annotate this. Cogatlas does have "concepts" like that (that's the whole purpose of the project), but our data model currently expects the range of an assessment edge to be a specific controlled term of a tool. This is probably more of an issue for "bulk annotation" where the user annotating is not the usually the data owner who would have more insight.
Wrong or conflicting description. Low quality data dictionaries are an issue because now I don't know who to believe. Example:
ds000144
has a column GADDx_p
. Lools like generalized anx disorder, yes? But description says: "Separation Anxiety Disorder at Preschool". Most likely the description is incorrect.
Not a huge deal for the annotation tool. Could just have a workflow to change the description. But tricky for the bulk-annotator, because I am not the data owner and I don't know what is correct.
Duplicate columns or leftover stuff. Some participants.tsv files are pretty low quality. For example ds000201
has column SRH5_byScanner.y
which is likely just a messy pandas merge leftover duplicate. Nothing to do really, just good to know that we're dealing with this kind of quality (and this is probably not a particularly bad dataset).
Multi-session info in participant.tsv file. This is related to the multiple conflicting diagnosis issue described above. The problem arises from people putting repeated measures in the participants.tsv
file by just adding more wide columns. We need to decide what we want to do with this. But it will probably be a lot easier to handle when we are atleast aware of the multiple sessions.
OK, I think this thing is done. Overall summary:
measurement that isn't a tool. For example: ds000201 has BMI1 for Body Mass Index on session 1. Cool. But cogatlas doesn't have that. Makes sense, it's not a tool. More like "heart rate" or "height". Still, something we'd probably like to annotate. So what should we do with these things? SNOMED has them, but then we mix vocabularies...
missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201 has HADS_Anxiety which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006, even down to the subscale. Hhmm...
I had assumed we were going to be mixing vocabularies anyway...
clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations.
This might point to the need to have controlled vocabulary term metadata on display to help guide the user
I had assumed we were going to be mixing vocabularies anyway...
No. We have two principles so far:
There may be some cases now where we are not being consistent about 2 (e.g. healty-controls currently comes from NCIT rather than SNOMED). But that's one of the "lessons learned" from the OMOP folks that we should really stick with: one vocabulary per variable.
This might point to the need to have controlled vocabulary term metadata on display to help guide the user
Yeah. It might not be enough to just show the name in a dropdown. Maybe we need, as you say, other metadata. Let's see. I could imagine that this can get quite complex quickly.
Moving into Review - Active, I'll take a look at the tsv as well 🙂
(sorry, pressed comment before I was ready) Thanks @surchs for your first stab at dataset annotation! Think all your points are very important.
Below are my comments on some of the issues.
There are levels in there that we cannot model, but we want to model the other levels [e.g., of a diagnosis/subject group column]
I think some options for us are (barring major changes to the current data model) (a) pick the closest available term from the vocab, even if it's not 100% accurate (b) consider creating an nb:Other
term URL to annotate levels that don't have any close controlled term, as a way to "flag" diagnosis/group values assigned to a participant but which cannot be modeled, without annotating them as missing in the graph.
Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool.
Agree that I think in practice this doesn't work well. Especially because missing values in assessment tool subscales are so common and have many ways to be imputed during statistical analysis, I don't think it'd be very useful to impose an "all or none" approach at the cohort definition stage. I think for now we can loosen our constraint on this to annotate a subject as "having" an assessment if they have non-missing values for any (up to all) of the columns for a tool.
A participant with [session-level diagnoses] is currently hard to model.
Since diagnosis is currently handled at the subject level and not the session level, I think for bulk annotations the best we can do is store the diagnosis at baseline, and potentially flag longitudinal data using another "Decision" option in the spreadsheet (maybe "revisit"?) We probably want to create another issue to discuss how/if we want to start modelling phenotypic info at the session level. I imagine for age also this will be important soon.
measurement that isn't a tool (e.g., BMI) / missing terms for assessment tools
I would be strongly in favor of subdividing our current "Assessment" class (which I feel is too broad for one vocab), into at minimum "Cognitive Assessment" and "Clinical Assessment". Due to the conceptual focus of the Cognitive Atlas it makes sense that it would have pretty limited coverage of terms for instruments to measure severity of specific illnesses, and I think the number of tools we would not be able to model (esp if want to be able to support clinical/patient annotations) could quickly outnumber those we can if we stick to just this vocab for every assessment. One idea then could be to have a ClinicalAssessment
class employing SNOMED terms, and CognitiveAssessment
class employing the cogatlas. I think it would also be reasonable to fit physiological measurements (BMI, heart rate, BP, etc.) under the ClinicalAssessment
category.
need to look inside numeric column to categorize
What do you mean by "look inside" (e.g., saps7
)? If we know that saps7
is a numeric column, has 6 unique values, and is part of the SAPS tool, is that not sufficient for annotation? Or are you referring to needing a description for what the column is recording?
Data quality issues
I would agree that for these types of issues, we would just have to say that we can't model the column due to "poor data quality". On the bright side, I think this process is revealing the importance of prospective rather than retroactive bulk annotation, because these types of errors are very challenging to resolve by a third party/after the fact. +1 for annotation tool route.
Thanks for your comments @alyssadai! I agree, we should discuss each of these. I'll link this conversation on the internal wiki to keep a record and then move some of these points over there: https://github.com/neurobagel/documentation/wiki/Neurobagel-Data-Model-limitations
Good datasets are:
Here is the main GDrive spreadsheet with all datasets.
To complete:
a copyedit: a new sheet of the original .tsv in https://github.com/neurobagel/bulk_annotations/issues/2 so that we can use these data for further processingFor reviewer: please take a look at the document and see if you have any notes on how we could make this easier to parse.