Closed surchs closed 6 months ago
any preferences on the name of the repo? we already have https://github.com/neurobagel/openneuro-annotations, so https://github.com/neurobagel/indi-annotations might be obvious?
Notes in no particular order:
50642
and in the BIDS dir: sub-0050642
🤷 OK, looks like we will get at least another 2752 subjects from these (without CORR and some failures so far). Nice!
😢
In case it helps to minimize effort duplication: see https://github.com/neurobagel/bulk_annotations?tab=readme-ov-file#running-the-bagel-cli-on-bulk-annotated-data for scripts we used to do the CLI running automatically for annotated OpenNeuro datasets !
Also, https://github.com/neurobagel/bulk_annotations/blob/main/datalad_get_single_dataset.sh
For reviewer: take a look at https://github.com/neurobagel/indi-annotations, particularly the README and see if all makes sense. The .jsonld files are in /data
I think there may be an issue with the new INDI datasets not having a known "origin site"? Can't remember how that information gets added / if it is coming from the federation API. Also, the links and dataset paths are likely not useful
edit: the flag is --portal "https://datasets.datalad.org/..."
Hi @surchs, thanks for your (very speedy) work on this!
I've taken a look through the indi-annotations repo, and have left some questions + suggestions below for making the process easier to understand/repeat in future (trying to avoid some of the problems we had with openneuro-annotations). Some of the comments might be best addressed in their own issues.
[x] Could you link the datalad webpages where you can explore each of the harmonized superdatasets, e.g. https://datasets.datalad.org/?dir=/abide/RawDataBIDS ? I find these helpful as a quick reference (since it looks like the INDI datasets are not all explorable in the Datalad GitHub repo directly, unlike OpenNeuro), whereas https://datasets.datalad.org/abide for example hides some of the files that we use.
[x] dataset_path_mappings.tsv
- could you briefly document how (and at what points in your process) you created the columns in this file (e.g., was dataset_path
populated manually)? It looks like it's used in the step1 script, but I'm assuming the data dictionaries in the use_dictionary
column did not exist yet at that point.
[x] For step 2 reproducibility, would you mind including the exact command (or an example with actual dataset paths) you used to run the 'combine participant TSVs' script for each specific superdataset? For example, I assumed based on your notes in other README sections that you only ran this for ABIDE 1, ABIDE 2, and ADHD 200, since CORR had an aggregated file already, but was confused initially when I read that you only ran the command for 3 datasets.
[x] For extra clarity, could you include under the step 3 heading the file names of the final Neurobagel data dictionaries generated using the annotation tool? Alternatively, maybe we can move these to their own subdirectory annotated_data_dictionaries/
Any info on:
[x] ~ how long it took for CLI to run on each dataset-site?
[x] the number/names of INDI (super)datasets left that we did not harmonize? (I always have trouble finding the full list, and they're not grouped on Datalad AFAICT - do you have a resource for this?)
[x] Should we add the 'fixed' phenotypic TSVs for each dataset to indi-annotations for easier CLI rerunning?
[ ] 🍒 : should we add a table to the README that includes the total sample size per superdataset? I think we've often had this question for QPN/PPMI - may be helpful to have this on hand
participants2bids.py
latest
version of the CLI and API images?Alright, thanks a lot for this very thorough review @alyssadai :tada: I feel like I have graduated from a degree program now :man_student:
I addressed several of your points in the linked PR, but for some I'll reply inline
~ how long it took for CLI to run on each dataset-site?
Unsure, we'd need to rerun this and log. Roughly speaking: not very long. And I didn't use any parallelism. So let's say "fast enough"
the number/names of INDI (super)datasets left that we did not harmonize? (I always have trouble finding the full list, and they're not grouped on Datalad AFAICT - do you have a resource for this?)
Again, tricky to answer. INDI is an initiative, technically only the FCON1000 datasets are really "theirs" (I think) - and we don't have those. The 4 ones we do have are the ones datalad has. So we got 4/4. But we did not get every site out of the 4. Now it'd make sense to figure out which ones we're missing. Seems to me that we got
Should we add the 'fixed' phenotypic TSVs for each dataset to indi-annotations for easier CLI rerunning?
I was considering this, but I'd rather not. This would be us directly and openly resharing the pheno data, and I'd rather get the OK from INDI for that first.
should we add a table to the README that includes the total sample size per superdataset? I think we've often had this question for QPN/PPMI - may be helpful to have this on hand
maybe, but I'd say this would be a new issue - beyond this PR.
We should start collecting variables in phenotypic TSVs we aren't able to annotate
Yes we should! I started this in my own notes. We should probably combine with the OpenNeuro notes and address in https://github.com/neurobagel/planning/issues/67
Could you open a CLI issue to add the command order to the help text?
https://github.com/neurobagel/bagel-cli/issues/279
the only "manual" data cleaning step was to BIDSify the participant ID column?
That and adding dataset_description.json
for datasets that didn't have it (and thus were not BIDS valid
Just for our own knowledge - did you use the latest version of the CLI and API images?
Yes I did
With the annotation tool!