neurobagel / planning

MIT License
0 stars 0 forks source link

Annotate INDI datasets #117

Closed surchs closed 6 months ago

surchs commented 7 months ago

With the annotation tool!

surchs commented 7 months ago

any preferences on the name of the repo? we already have https://github.com/neurobagel/openneuro-annotations, so https://github.com/neurobagel/indi-annotations might be obvious?

surchs commented 7 months ago

Notes in no particular order:

OK, looks like we will get at least another 2752 subjects from these (without CORR and some failures so far). Nice!

Image

😢

alyssadai commented 7 months ago

In case it helps to minimize effort duplication: see https://github.com/neurobagel/bulk_annotations?tab=readme-ov-file#running-the-bagel-cli-on-bulk-annotated-data for scripts we used to do the CLI running automatically for annotated OpenNeuro datasets !

Also, https://github.com/neurobagel/bulk_annotations/blob/main/datalad_get_single_dataset.sh

surchs commented 7 months ago

For reviewer: take a look at https://github.com/neurobagel/indi-annotations, particularly the README and see if all makes sense. The .jsonld files are in /data

surchs commented 7 months ago

I think there may be an issue with the new INDI datasets not having a known "origin site"? Can't remember how that information gets added / if it is coming from the federation API. Also, the links and dataset paths are likely not useful

edit: the flag is --portal "https://datasets.datalad.org/..."

alyssadai commented 6 months ago

Hi @surchs, thanks for your (very speedy) work on this!

I've taken a look through the indi-annotations repo, and have left some questions + suggestions below for making the process easier to understand/repeat in future (trying to avoid some of the problems we had with openneuro-annotations). Some of the comments might be best addressed in their own issues.

In the README:

Data

Any info on:

Code:

participants2bids.py

Annotation

General:

surchs commented 6 months ago

Alright, thanks a lot for this very thorough review @alyssadai :tada: I feel like I have graduated from a degree program now :man_student:

I addressed several of your points in the linked PR, but for some I'll reply inline

~ how long it took for CLI to run on each dataset-site?

Unsure, we'd need to rerun this and log. Roughly speaking: not very long. And I didn't use any parallelism. So let's say "fast enough"

the number/names of INDI (super)datasets left that we did not harmonize? (I always have trouble finding the full list, and they're not grouped on Datalad AFAICT - do you have a resource for this?)

Again, tricky to answer. INDI is an initiative, technically only the FCON1000 datasets are really "theirs" (I think) - and we don't have those. The 4 ones we do have are the ones datalad has. So we got 4/4. But we did not get every site out of the 4. Now it'd make sense to figure out which ones we're missing. Seems to me that we got

Should we add the 'fixed' phenotypic TSVs for each dataset to indi-annotations for easier CLI rerunning?

I was considering this, but I'd rather not. This would be us directly and openly resharing the pheno data, and I'd rather get the OK from INDI for that first.

should we add a table to the README that includes the total sample size per superdataset? I think we've often had this question for QPN/PPMI - may be helpful to have this on hand

maybe, but I'd say this would be a new issue - beyond this PR.

We should start collecting variables in phenotypic TSVs we aren't able to annotate

Yes we should! I started this in my own notes. We should probably combine with the OpenNeuro notes and address in https://github.com/neurobagel/planning/issues/67

Could you open a CLI issue to add the command order to the help text?

https://github.com/neurobagel/bagel-cli/issues/279

the only "manual" data cleaning step was to BIDSify the participant ID column?

That and adding dataset_description.json for datasets that didn't have it (and thus were not BIDS valid

Just for our own knowledge - did you use the latest version of the CLI and API images?

Yes I did