neurobagel / planning

MIT License
0 stars 0 forks source link

Update bulk OpenNeuro annotations following Neurobagel vocab changes #50

Closed alyssadai closed 1 year ago

alyssadai commented 1 year ago

Relevant vocab-related changes

Namespace used for healthy controls/subject groups (JSON)

Age heuristics (JSON)

Incorporation of nb term vocabulary (graph database)

Steps to perform update

alyssadai commented 1 year ago

For reviewer:

You can check the changes to the open_neuro:

Note that while 341/441 total ON datasets succeeded the CLI step (with manual help in the case of 1 dataset) and so had JSONLD files regenerated, only 340 were successfully uploaded because one JSONLD (ds002000) had NaN values for age which cannot be parsed. This is due to the JSON for this dataset not including "NaN" in the missing values for age, and the CLI then converting that to a float value -> nan.

surchs commented 1 year ago

This is due to the JSON for this dataset not including "NaN" in the missing values for age, and the CLI then converting that to a float value -> nan.

Would it make sense to just add this value to the hard-coded default missing values here: https://github.com/neurobagel/bulk_annotations/blob/313447e642324684a3e2e39d2f35a7661206653d/process_annotation_to_dict.py#L105

surchs commented 1 year ago

One more question @alyssadai: is the new graph already live for the openneuro API? If I'm running an empty query through the query tool I get 334 datasets, 16206 subjects which might be the old number, but I cannot recall exactly

alyssadai commented 1 year ago

Hey @surchs,

Would it make sense to just add this value to the hard-coded default missing values here: https://github.com/neurobagel/bulk_annotations/blob/313447e642324684a3e2e39d2f35a7661206653d/process_annotation_to_dict.py#L105

I think that looks right. If I understand/remember the code correctly the other function with hardcoded missing values get_missing in that script is for handling the discrete columns, right? So I think the function you indicated would be the right spot to address the current issue.

is the new graph already live for the openneuro API? If I'm running an empty query through the query tool I get 334 datasets, 16206 subjects

Yes, the live graph contains the updated OpenNeuro data. You can also confirm this by querying for healthy controls - previously this would have returned nothing since the in-graph TermURLs for healthy control were out of date with our other tools.

I think we ran into this issue previously of not all datasets/subjects in the graph being returned in an empty query. I’ll investigate this again tmrw but I think our hypothesis last time was that the current query sent by the API assumes there's at least one session per subject in each dataset, but subjects with only phenotypic data currently don't have sessions due to our data model. As a result, it’s possible that datasets with subjects who have imaging data that can’t be modelled somehow just never match any API query to the graph (?).

TLDR, I think this is a problem with our query template in the API rather than those “missing” datasets not actually being in the graph. You can confirm this also by running a simple query to count the datasets in the graph in Stardog Studio, which should return 340.

surchs commented 1 year ago

That sounds reasonable as an explanation. I would say we close this issue here and then

I think it makes more sense to address the missing values like this instead of adding more workarounds a I initially suggested.

Could you create the new tracking issue for the #dataset mismatch and then close?