Annotate INDI datasets - Githubissues

surchs commented 7 months ago

With the annotation tool!

[x] Create git repo for INDI
[x] Create data dictionary for ABIDE 1 (test case)
[x] reproducible script that does
- [x] fetch data from datalad
- [x] get participants.tsv for each site
- [x] add data dictionary to each site
- [x] process each site
- [x] store .jsonld into new git repo

surchs commented 7 months ago

any preferences on the name of the repo? we already have https://github.com/neurobagel/openneuro-annotations, so https://github.com/neurobagel/indi-annotations might be obvious?

surchs commented 7 months ago

Notes in no particular order:

CLI
- from the helptext in the CLI, it is not explicitly clear which command to run first. Annoying when I have to learn-by-error, or open the docs first -> should make explicit
- or even better: fully decouple the two and make them independent!
Data
- all INDI participants.tsv files don't have the full BIDS id in their subject ID column (missing the sub-) part
- ABIDE 1 participants.tsv files on top of that also subject ID values that are missing leading zeros found in the BIDS dirs. E.g. in the participants.tsv file: 50642 and in the BIDS dir: sub-0050642 🤷
- here's the data dictionary: https://fcon_1000.projects.nitrc.org/indi/abide/ABIDE_LEGEND_V1.02.pdf
- ABIDE 2 data dictionary: https://fcon_1000.projects.nitrc.org/indi/abide/ABIDEII_Data_Legend.pdf
Annotator
- does not have a good time with large .tsv (>600kb for Abide 2). Gets super slow, unresponsive, even when just loading the file

OK, looks like we will get at least another 2752 subjects from these (without CORR and some failures so far). Nice!

😢

alyssadai commented 7 months ago

In case it helps to minimize effort duplication: see https://github.com/neurobagel/bulk_annotations?tab=readme-ov-file#running-the-bagel-cli-on-bulk-annotated-data for scripts we used to do the CLI running automatically for annotated OpenNeuro datasets !

Also, https://github.com/neurobagel/bulk_annotations/blob/main/datalad_get_single_dataset.sh

surchs commented 7 months ago

For reviewer: take a look at https://github.com/neurobagel/indi-annotations, particularly the README and see if all makes sense. The .jsonld files are in /data

surchs commented 7 months ago

I think there may be an issue with the new INDI datasets not having a known "origin site"? Can't remember how that information gets added / if it is coming from the federation API. Also, the links and dataset paths are likely not useful

edit: the flag is --portal "https://datasets.datalad.org/..."

alyssadai commented 6 months ago

Hi @surchs, thanks for your (very speedy) work on this!

I've taken a look through the indi-annotations repo, and have left some questions + suggestions below for making the process easier to understand/repeat in future (trying to avoid some of the problems we had with openneuro-annotations). Some of the comments might be best addressed in their own issues.

In the README:

[x] Could you link the datalad webpages where you can explore each of the harmonized superdatasets, e.g. https://datasets.datalad.org/?dir=/abide/RawDataBIDS ? I find these helpful as a quick reference (since it looks like the INDI datasets are not all explorable in the Datalad GitHub repo directly, unlike OpenNeuro), whereas https://datasets.datalad.org/abide for example hides some of the files that we use.
[x] dataset_path_mappings.tsv - could you briefly document how (and at what points in your process) you created the columns in this file (e.g., was dataset_path populated manually)? It looks like it's used in the step1 script, but I'm assuming the data dictionaries in the use_dictionary column did not exist yet at that point.
[x] For step 2 reproducibility, would you mind including the exact command (or an example with actual dataset paths) you used to run the 'combine participant TSVs' script for each specific superdataset? For example, I assumed based on your notes in other README sections that you only ran this for ABIDE 1, ABIDE 2, and ADHD 200, since CORR had an aggregated file already, but was confused initially when I read that you only ran the command for 3 datasets.
[x] For extra clarity, could you include under the step 3 heading the file names of the final Neurobagel data dictionaries generated using the annotation tool? Alternatively, maybe we can move these to their own subdirectory annotated_data_dictionaries/

Data

Any info on:

[x] ~ how long it took for CLI to run on each dataset-site?
[x] the number/names of INDI (super)datasets left that we did not harmonize? (I always have trouble finding the full list, and they're not grouped on Datalad AFAICT - do you have a resource for this?)
- If our 4 represent a small subset, we may want to refer to the superdatasets by name moving forward so we're transparent about our data coverage 🤔
[x] Should we add the 'fixed' phenotypic TSVs for each dataset to indi-annotations for easier CLI rerunning?
[ ] 🍒 : should we add a table to the README that includes the total sample size per superdataset? I think we've often had this question for QPN/PPMI - may be helpful to have this on hand

Code:

participants2bids.py

[x] 🍒: Brief function docstrings? We may want to include this helper script in one of our main repos/docs in the future, as I think having non-BIDS-compliant subject IDs in the phenotypic TSV is fairly common (seen already in QPN/PPMI)
If we reuse this, we may also want to consider adding + annotating a new, BIDsified participant ID column rather than modifying the existing one, since it otherwise means that there are cols in the original data that do not exist in our versions
[x] Could you expand the comment on the purpose of this section? https://github.com/neurobagel/indi-annotations/blob/9c940a60c4452f22bbd0e5b345885eb4d7b108a9/participants2bids.py#L54-L56 Am I understanding correctly that you want to generate site-specific data dictionaries based on the superdataset dictionary, b/c the CLI will complain if there are annotated columns in the .json not found in the .tsv?

Annotation

We should start collecting variables in phenotypic TSVs we aren't able to annotate (and maybe have a spreadsheet for the datasets we've annotated that includes ratios of # annotatable variables / non-annotatable variables), for reference & future data model expansion

General:

[x] Could you open a CLI issue to add the command order to the help text?
[x] Am I understanding correctly that the only "manual" data cleaning step you had to do for the phenotypic TSVs (other than merge them into one file across sites) prior to annotation or CLI running was to BIDSify the participant ID column?
[x] Just for our own knowledge - did you use the latest version of the CLI and API images?

surchs commented 6 months ago

Alright, thanks a lot for this very thorough review @alyssadai :tada: I feel like I have graduated from a degree program now :man_student:

I addressed several of your points in the linked PR, but for some I'll reply inline

~ how long it took for CLI to run on each dataset-site?

Unsure, we'd need to rerun this and log. Roughly speaking: not very long. And I didn't use any parallelism. So let's say "fast enough"

the number/names of INDI (super)datasets left that we did not harmonize? (I always have trouble finding the full list, and they're not grouped on Datalad AFAICT - do you have a resource for this?)

Again, tricky to answer. INDI is an initiative, technically only the FCON1000 datasets are really "theirs" (I think) - and we don't have those. The 4 ones we do have are the ones datalad has. So we got 4/4. But we did not get every site out of the 4. Now it'd make sense to figure out which ones we're missing. Seems to me that we got

everything from ABIDE 1
15/19 for ABIDE 2, except
- abide2/ETHZ_1
- abide2/GU_1
- abide2/UCLA_Long
- abide2/UPSM_Long
9/10 ADHD 200, except
- adhd200/Peking_1
28/30 CORR, except
- corr/DC_1
- corr/IPCAS_2

Should we add the 'fixed' phenotypic TSVs for each dataset to indi-annotations for easier CLI rerunning?

I was considering this, but I'd rather not. This would be us directly and openly resharing the pheno data, and I'd rather get the OK from INDI for that first.

should we add a table to the README that includes the total sample size per superdataset? I think we've often had this question for QPN/PPMI - may be helpful to have this on hand

maybe, but I'd say this would be a new issue - beyond this PR.

We should start collecting variables in phenotypic TSVs we aren't able to annotate

Yes we should! I started this in my own notes. We should probably combine with the OpenNeuro notes and address in https://github.com/neurobagel/planning/issues/67

Could you open a CLI issue to add the command order to the help text?

https://github.com/neurobagel/bagel-cli/issues/279

the only "manual" data cleaning step was to BIDSify the participant ID column?

That and adding dataset_description.json for datasets that didn't have it (and thus were not BIDS valid

Just for our own knowledge - did you use the latest version of the CLI and API images?

Yes I did

neurobagel / planning

Annotate INDI datasets #117

In the README:

Data

Code:

Annotation

General: