Annotate four representative OpenNeuro datasets

Good datasets are:

[x] ds000115
[x] ds000144
[x] ds000201
[x] ds003770

Here is the main GDrive spreadsheet with all datasets.

To complete:

[x] Annotate or exclude each level
[x] Make ~~a copy~~ edit: a new sheet of the original .tsv in https://github.com/neurobagel/bulk_annotations/issues/2 so that we can use these data for further processing
[x] document any relevant things in this issue.

For reviewer: please take a look at the document and see if you have any notes on how we could make this easier to parse.

Data Model related issues

Missing conditions. ds000115 has a condit column with what looks like clinical conditions. There are levels in there that we cannot model, but we want to model the other levels. Example:

SCZ-SIB -> probably sibling with schizophrenia -> cannot model
SCZ -> probably participant with schizophrenia -> we can model So what do I do about the things we can't model? Our constraints are:
need to map every unique level in diagnosis column to controlled term, or
mark the level as "missing value" If those are my only options, I need to label this as "missing". Which is clearly wrong. So we probably want to have another category for "cannot model" or otherwise rethink our annotation approach.

Problem with the way we model assessment tools. Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool. That's not great for several reasons:

The data owner in their infinite wisdom may have decided to discretize a continuous raw score and spread the information across several columns. Then, by definition, only one column would have a non-missing value
The fact that the other columns have information still means that there is something to be learned for this participant. The non-missing columns are probably still useful Not saying we should add many complex heuristics to determine presence or absence. Any or All might be enough.

A participant with two conflicting diagnoses is currently hard to model. Example: ds000144 has SADDx_p for social anxiety at preschool and SADDx_f social anxiety at follow-up, but only one imaging session with an unrelated name. If one is "Yes" and the other "No", how should we display it in our model? This is probably mainly a problem for pure BIDS datasets that are doing these wonky hacks to introduce multi-session information into the phenotypic file even though they are not allowed to. so maybe a good response is to say: can't do it -> use BIDS-pheno or MR_proc.

Controlled Vocabulary related problems

Having to look up the controlled terms by hand is pretty annoying (and probably prone to error). If we turn this into a workflow, it will absolutely have to have the possible values pre-configured.

measurement that isn't a tool. For example: ds000201 has BMI1 for Body Mass Index on session 1. Cool. But cogatlas doesn't have that. Makes sense, it's not a tool. More like "heart rate" or "height". Still, something we'd probably like to annotate. So what should we do with these things? SNOMED has them, but then we mix vocabularies...

missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201 has HADS_Anxiety which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006, even down to the subscale. Hhmm...

clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations.

Continuous values problems

categorical variable encoded with numbers When data owners describe a cateogorical variable with numbers, I cannot:

easily designate missing values (because their unique value is not listed)
easily map meaningful numeric values (again, because not listed) Example: in ds000144 I have SADDx_p encoding presence of social anxiety. From the data dictionary I can see that it is encoded as
0: Not Meeting Criteria for SAD at Preschool
1: Meeting Criteria for SAD at Preschool So one fix would be to handle it as a categorical variable on account of the Levels key in the data dictionary. If the data dictionary was incomplete or missing, we would have to find another way to address this. This will not be a problem for the annotation tool because the user will first tell us what the column is about, so that we can pick the correct workflow from that information.

need to look inside numeric column to categorize Some continuous columns you actually want to look inside of. If a column has ages, I don't want to see every unique value of course. But people use continuous values for categories and other other things that aren't very obvious. For example: in ds000115 there is a numeric column saps7. Most likely that has something to do with the Scale for the Assessment of Positive Symptoms for Schizophrenia. But what does the 7 mean? Would be good to look inside of this column. If we end up turning this process into a bulk tool, there should probably be a way to inspect even numeric columns.

Data quality issues

Tool name not recognizable. Some datasets annotate their data with the measured concept rather than the name of the tool. For example "Handedness" instead of "Edinburgh Handedness Inventory". I'm not sure how to annotate this. Cogatlas does have "concepts" like that (that's the whole purpose of the project), but our data model currently expects the range of an assessment edge to be a specific controlled term of a tool. This is probably more of an issue for "bulk annotation" where the user annotating is not the usually the data owner who would have more insight.

Wrong or conflicting description. Low quality data dictionaries are an issue because now I don't know who to believe. Example: ds000144 has a column GADDx_p. Lools like generalized anx disorder, yes? But description says: "Separation Anxiety Disorder at Preschool". Most likely the description is incorrect. Not a huge deal for the annotation tool. Could just have a workflow to change the description. But tricky for the bulk-annotator, because I am not the data owner and I don't know what is correct.

Duplicate columns or leftover stuff. Some participants.tsv files are pretty low quality. For example ds000201 has column SRH5_byScanner.y which is likely just a messy pandas merge leftover duplicate. Nothing to do really, just good to know that we're dealing with this kind of quality (and this is probably not a particularly bad dataset).

Multi-session info in participant.tsv file. This is related to the multiple conflicting diagnosis issue described above. The problem arises from people putting repeated measures in the participants.tsv file by just adding more wide columns. We need to decide what we want to do with this. But it will probably be a lot easier to handle when we are atleast aware of the multiple sessions.

OK, I think this thing is done. Overall summary:

we need to figure out a way to expand some numerical columns when they are used to designate categorical stuff (e.g. diagnoses). presence of a data dictionary with a "Levels" key might be sufficient signal for now
there is a pretty large percentage of assessment tools I could not model (even just for 4 datasets) because Cognitive Atlas doesn't have it. Some of these assessments are in SNOMED, but not all of them. So we'll have to think about what we want to do here.

measurement that isn't a tool. For example: ds000201 has BMI1 for Body Mass Index on session 1. Cool. But cogatlas doesn't have that. Makes sense, it's not a tool. More like "heart rate" or "height". Still, something we'd probably like to annotate. So what should we do with these things? SNOMED has them, but then we mix vocabularies...

missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201 has HADS_Anxiety which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006, even down to the subscale. Hhmm...

I had assumed we were going to be mixing vocabularies anyway...

clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations.

This might point to the need to have controlled vocabulary term metadata on display to help guide the user

I had assumed we were going to be mixing vocabularies anyway...

No. We have two principles so far:

Make it easy to use
Tell people what vocabulary to pick values from

There may be some cases now where we are not being consistent about 2 (e.g. healty-controls currently comes from NCIT rather than SNOMED). But that's one of the "lessons learned" from the OMOP folks that we should really stick with: one vocabulary per variable.

This might point to the need to have controlled vocabulary term metadata on display to help guide the user

Yeah. It might not be enough to just show the name in a dropdown. Maybe we need, as you say, other metadata. Let's see. I could imagine that this can get quite complex quickly.

Moving into Review - Active, I'll take a look at the tsv as well 🙂

(sorry, pressed comment before I was ready) Thanks @surchs for your first stab at dataset annotation! Think all your points are very important.

Below are my comments on some of the issues.

There are levels in there that we cannot model, but we want to model the other levels [e.g., of a diagnosis/subject group column]

I think some options for us are (barring major changes to the current data model) (a) pick the closest available term from the vocab, even if it's not 100% accurate (b) consider creating an nb:Other term URL to annotate levels that don't have any close controlled term, as a way to "flag" diagnosis/group values assigned to a participant but which cannot be modeled, without annotating them as missing in the graph.

Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool.

Agree that I think in practice this doesn't work well. Especially because missing values in assessment tool subscales are so common and have many ways to be imputed during statistical analysis, I don't think it'd be very useful to impose an "all or none" approach at the cohort definition stage. I think for now we can loosen our constraint on this to annotate a subject as "having" an assessment if they have non-missing values for any (up to all) of the columns for a tool.

A participant with [session-level diagnoses] is currently hard to model.

Since diagnosis is currently handled at the subject level and not the session level, I think for bulk annotations the best we can do is store the diagnosis at baseline, and potentially flag longitudinal data using another "Decision" option in the spreadsheet (maybe "revisit"?) We probably want to create another issue to discuss how/if we want to start modelling phenotypic info at the session level. I imagine for age also this will be important soon.

measurement that isn't a tool (e.g., BMI) / missing terms for assessment tools

I would be strongly in favor of subdividing our current "Assessment" class (which I feel is too broad for one vocab), into at minimum "Cognitive Assessment" and "Clinical Assessment". Due to the conceptual focus of the Cognitive Atlas it makes sense that it would have pretty limited coverage of terms for instruments to measure severity of specific illnesses, and I think the number of tools we would not be able to model (esp if want to be able to support clinical/patient annotations) could quickly outnumber those we can if we stick to just this vocab for every assessment. One idea then could be to have a ClinicalAssessment class employing SNOMED terms, and CognitiveAssessment class employing the cogatlas. I think it would also be reasonable to fit physiological measurements (BMI, heart rate, BP, etc.) under the ClinicalAssessment category.

need to look inside numeric column to categorize

What do you mean by "look inside" (e.g., saps7)? If we know that saps7 is a numeric column, has 6 unique values, and is part of the SAPS tool, is that not sufficient for annotation? Or are you referring to needing a description for what the column is recording?

Data quality issues

I would agree that for these types of issues, we would just have to say that we can't model the column due to "poor data quality". On the bright side, I think this process is revealing the importance of prospective rather than retroactive bulk annotation, because these types of errors are very challenging to resolve by a third party/after the fact. +1 for annotation tool route.

Thanks for your comments @alyssadai! I agree, we should discuss each of these. I'll link this conversation on the internal wiki to keep a record and then move some of these points over there: https://github.com/neurobagel/documentation/wiki/Neurobagel-Data-Model-limitations

neurobagel / bulk_annotations