neurobagel / bulk_annotations

Retroactively annotate a large number of BIDS datasets at once
MIT License
0 stars 1 forks source link

Annotate four representative OpenNeuro datasets #10

Closed surchs closed 1 year ago

surchs commented 1 year ago

Good datasets are:

Here is the main GDrive spreadsheet with all datasets.

To complete:

For reviewer: please take a look at the document and see if you have any notes on how we could make this easier to parse.

surchs commented 1 year ago

Data Model related issues

Missing conditions. ds000115 has a condit column with what looks like clinical conditions. There are levels in there that we cannot model, but we want to model the other levels. Example:

Problem with the way we model assessment tools. Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool. That's not great for several reasons:

  1. The data owner in their infinite wisdom may have decided to discretize a continuous raw score and spread the information across several columns. Then, by definition, only one column would have a non-missing value
  2. The fact that the other columns have information still means that there is something to be learned for this participant. The non-missing columns are probably still useful Not saying we should add many complex heuristics to determine presence or absence. Any or All might be enough.

A participant with two conflicting diagnoses is currently hard to model. Example: ds000144 has SADDx_p for social anxiety at preschool and SADDx_f social anxiety at follow-up, but only one imaging session with an unrelated name. If one is "Yes" and the other "No", how should we display it in our model? This is probably mainly a problem for pure BIDS datasets that are doing these wonky hacks to introduce multi-session information into the phenotypic file even though they are not allowed to. so maybe a good response is to say: can't do it -> use BIDS-pheno or MR_proc.

Controlled Vocabulary related problems

Having to look up the controlled terms by hand is pretty annoying (and probably prone to error). If we turn this into a workflow, it will absolutely have to have the possible values pre-configured.

measurement that isn't a tool. For example: ds000201 has BMI1 for Body Mass Index on session 1. Cool. But cogatlas doesn't have that. Makes sense, it's not a tool. More like "heart rate" or "height". Still, something we'd probably like to annotate. So what should we do with these things? SNOMED has them, but then we mix vocabularies...

missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201 has HADS_Anxiety which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006, even down to the subscale. Hhmm...

clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations.

Continuous values problems

categorical variable encoded with numbers When data owners describe a cateogorical variable with numbers, I cannot:

need to look inside numeric column to categorize Some continuous columns you actually want to look inside of. If a column has ages, I don't want to see every unique value of course. But people use continuous values for categories and other other things that aren't very obvious. For example: in ds000115 there is a numeric column saps7. Most likely that has something to do with the Scale for the Assessment of Positive Symptoms for Schizophrenia. But what does the 7 mean? Would be good to look inside of this column. If we end up turning this process into a bulk tool, there should probably be a way to inspect even numeric columns.

Data quality issues

Tool name not recognizable. Some datasets annotate their data with the measured concept rather than the name of the tool. For example "Handedness" instead of "Edinburgh Handedness Inventory". I'm not sure how to annotate this. Cogatlas does have "concepts" like that (that's the whole purpose of the project), but our data model currently expects the range of an assessment edge to be a specific controlled term of a tool. This is probably more of an issue for "bulk annotation" where the user annotating is not the usually the data owner who would have more insight.

Wrong or conflicting description. Low quality data dictionaries are an issue because now I don't know who to believe. Example: ds000144 has a column GADDx_p. Lools like generalized anx disorder, yes? But description says: "Separation Anxiety Disorder at Preschool". Most likely the description is incorrect. Not a huge deal for the annotation tool. Could just have a workflow to change the description. But tricky for the bulk-annotator, because I am not the data owner and I don't know what is correct.

Duplicate columns or leftover stuff. Some participants.tsv files are pretty low quality. For example ds000201 has column SRH5_byScanner.y which is likely just a messy pandas merge leftover duplicate. Nothing to do really, just good to know that we're dealing with this kind of quality (and this is probably not a particularly bad dataset).

Multi-session info in participant.tsv file. This is related to the multiple conflicting diagnosis issue described above. The problem arises from people putting repeated measures in the participants.tsv file by just adding more wide columns. We need to decide what we want to do with this. But it will probably be a lot easier to handle when we are atleast aware of the multiple sessions.

surchs commented 1 year ago

OK, I think this thing is done. Overall summary:

jarmoza commented 1 year ago

measurement that isn't a tool. For example: ds000201 has BMI1 for Body Mass Index on session 1. Cool. But cogatlas doesn't have that. Makes sense, it's not a tool. More like "heart rate" or "height". Still, something we'd probably like to annotate. So what should we do with these things? SNOMED has them, but then we mix vocabularies...

missing terms. For assessment tools, there are things that exist in the real world, are assessments, but don't exist in cognitive atlas. Example: ds000201 has HADS_Anxiety which is the "Hospital Anxiety and Depression scale, Anxiety subscale". I can find that in SNOMED: snomed:273524006, even down to the subscale. Hhmm...

I had assumed we were going to be mixing vocabularies anyway...

clashing abbreviations. Not sure if this is a real problem. But I was looking for the "Beck Depression Inventory" in cognitive-atlas. They don't have it. They do have a "BDI" (the common abbreviation), but it refers to "battelle developmental inventory" - no idea what that is. I guess the main observation here is: we need to pay attention to how we let our users search for terms, because they might have a hard time finding the right term if the vocabulary isn't doing a great job with explicit names / abbreviations.

This might point to the need to have controlled vocabulary term metadata on display to help guide the user

surchs commented 1 year ago

I had assumed we were going to be mixing vocabularies anyway...

No. We have two principles so far:

  1. Make it easy to use
  2. Tell people what vocabulary to pick values from

There may be some cases now where we are not being consistent about 2 (e.g. healty-controls currently comes from NCIT rather than SNOMED). But that's one of the "lessons learned" from the OMOP folks that we should really stick with: one vocabulary per variable.

surchs commented 1 year ago

This might point to the need to have controlled vocabulary term metadata on display to help guide the user

Yeah. It might not be enough to just show the name in a dropdown. Maybe we need, as you say, other metadata. Let's see. I could imagine that this can get quite complex quickly.

alyssadai commented 1 year ago

Moving into Review - Active, I'll take a look at the tsv as well 🙂

alyssadai commented 1 year ago

(sorry, pressed comment before I was ready) Thanks @surchs for your first stab at dataset annotation! Think all your points are very important.

Below are my comments on some of the issues.

There are levels in there that we cannot model, but we want to model the other levels [e.g., of a diagnosis/subject group column]

I think some options for us are (barring major changes to the current data model) (a) pick the closest available term from the vocab, even if it's not 100% accurate (b) consider creating an nb:Other term URL to annotate levels that don't have any close controlled term, as a way to "flag" diagnosis/group values assigned to a participant but which cannot be modeled, without annotating them as missing in the graph.

Right now we assume that if you have several columns linked to the same assessment tool, if any of these has a missing value, then the participant doesn't "have" the tool.

Agree that I think in practice this doesn't work well. Especially because missing values in assessment tool subscales are so common and have many ways to be imputed during statistical analysis, I don't think it'd be very useful to impose an "all or none" approach at the cohort definition stage. I think for now we can loosen our constraint on this to annotate a subject as "having" an assessment if they have non-missing values for any (up to all) of the columns for a tool.

A participant with [session-level diagnoses] is currently hard to model.

Since diagnosis is currently handled at the subject level and not the session level, I think for bulk annotations the best we can do is store the diagnosis at baseline, and potentially flag longitudinal data using another "Decision" option in the spreadsheet (maybe "revisit"?) We probably want to create another issue to discuss how/if we want to start modelling phenotypic info at the session level. I imagine for age also this will be important soon.

measurement that isn't a tool (e.g., BMI) / missing terms for assessment tools

I would be strongly in favor of subdividing our current "Assessment" class (which I feel is too broad for one vocab), into at minimum "Cognitive Assessment" and "Clinical Assessment". Due to the conceptual focus of the Cognitive Atlas it makes sense that it would have pretty limited coverage of terms for instruments to measure severity of specific illnesses, and I think the number of tools we would not be able to model (esp if want to be able to support clinical/patient annotations) could quickly outnumber those we can if we stick to just this vocab for every assessment. One idea then could be to have a ClinicalAssessment class employing SNOMED terms, and CognitiveAssessment class employing the cogatlas. I think it would also be reasonable to fit physiological measurements (BMI, heart rate, BP, etc.) under the ClinicalAssessment category.

need to look inside numeric column to categorize

What do you mean by "look inside" (e.g., saps7)? If we know that saps7 is a numeric column, has 6 unique values, and is part of the SAPS tool, is that not sufficient for annotation? Or are you referring to needing a description for what the column is recording?

Data quality issues

I would agree that for these types of issues, we would just have to say that we can't model the column due to "poor data quality". On the bright side, I think this process is revealing the importance of prospective rather than retroactive bulk annotation, because these types of errors are very challenging to resolve by a third party/after the fact. +1 for annotation tool route.

surchs commented 1 year ago

Thanks for your comments @alyssadai! I agree, we should discuss each of these. I'll link this conversation on the internal wiki to keep a record and then move some of these points over there: https://github.com/neurobagel/documentation/wiki/Neurobagel-Data-Model-limitations