Update bulk OpenNeuro annotations following Neurobagel vocab changes

alyssadai commented 1 year ago

Relevant vocab-related changes

Namespace used for healthy controls/subject groups (JSON)

Age heuristics (JSON)

https://github.com/neurobagel/bagel-cli/issues/128

Incorporation of nb term vocabulary (graph database)

https://github.com/neurobagel/api/issues/186

Steps to perform update

[x] Replace outdated TermURLs (for healthy control, age transformations) in all JSON files in https://github.com/neurobagel/openneuro-annotations
- [x] purl:NCIT_C94342 -> ncit:C94342
- [x] nb:{float,int,euro,bounded,iso8061} -> nb:{FromFloat,FromInt,FromEuro,FromBounded,FromISO8061}
- note: no instances of nb:iso8061
[x] Fetch changes from upstream repos into OpenNeuroDatasets-JSONLD forks, and push the updated JSONs to the forks
[x] Manually push updated JSONs to OpenNeuroDatasets-JSONLD repos for datasets that did not succeed the fork update
- this was done using https://github.com/OpenNeuroDatasets-JSONLD/.github/blob/main/code/update_json to preserve original indentation of participant.json files
[x] Rerun the CLI on the updated JSONs and regenerate JSONLD files
- [x] Ensure filepaths are from dataset root (included in https://github.com/neurobagel/bulk_annotations/pull/35)
- [x] Log IDs of failed datasets and compare to JSONLDs currently in https://github.com/neurobagel/openneuro-annotations
[x] Update JSONLDs in https://github.com/neurobagel/openneuro-annotations
- [x] https://github.com/neurobagel/openneuro-annotations/pull/6)
[x] Update data in ON graph, and add back nb_vocab.ttl file

alyssadai commented 1 year ago

For reviewer:

You can check the changes to the open_neuro:

A participant-level query for healthy controls should now return a non-empty list
A GET request to the /attributes/{data_element_URI} endpoint should still return a list of the instances used in the graph (since the vocabulary file has been added back to the graph after clearing it for the data update)

Sending a SPARQL query using Stardog Studio for the number of datasets in the graph should return 340:

SELECT (COUNT(*) as ?datasets)
WHERE {
SELECT DISTINCT ?dataset
WHERE {
    ?dataset a nb:Dataset;
            nb:hasLabel ?dataset_name
}
}

Note that while 341/441 total ON datasets succeeded the CLI step (with manual help in the case of 1 dataset) and so had JSONLD files regenerated, only 340 were successfully uploaded because one JSONLD (ds002000) had NaN values for age which cannot be parsed. This is due to the JSON for this dataset not including "NaN" in the missing values for age, and the CLI then converting that to a float value -> nan.

surchs commented 1 year ago

This is due to the JSON for this dataset not including "NaN" in the missing values for age, and the CLI then converting that to a float value -> nan.

Would it make sense to just add this value to the hard-coded default missing values here: https://github.com/neurobagel/bulk_annotations/blob/313447e642324684a3e2e39d2f35a7661206653d/process_annotation_to_dict.py#L105

surchs commented 1 year ago

One more question @alyssadai: is the new graph already live for the openneuro API? If I'm running an empty query through the query tool I get 334 datasets, 16206 subjects which might be the old number, but I cannot recall exactly

alyssadai commented 1 year ago

Hey @surchs,

Would it make sense to just add this value to the hard-coded default missing values here: https://github.com/neurobagel/bulk_annotations/blob/313447e642324684a3e2e39d2f35a7661206653d/process_annotation_to_dict.py#L105

I think that looks right. If I understand/remember the code correctly the other function with hardcoded missing values get_missing in that script is for handling the discrete columns, right? So I think the function you indicated would be the right spot to address the current issue.

is the new graph already live for the openneuro API? If I'm running an empty query through the query tool I get 334 datasets, 16206 subjects

Yes, the live graph contains the updated OpenNeuro data. You can also confirm this by querying for healthy controls - previously this would have returned nothing since the in-graph TermURLs for healthy control were out of date with our other tools.

I think we ran into this issue previously of not all datasets/subjects in the graph being returned in an empty query. I’ll investigate this again tmrw but I think our hypothesis last time was that the current query sent by the API assumes there's at least one session per subject in each dataset, but subjects with only phenotypic data currently don't have sessions due to our data model. As a result, it’s possible that datasets with subjects who have imaging data that can’t be modelled somehow just never match any API query to the graph (?).

TLDR, I think this is a problem with our query template in the API rather than those “missing” datasets not actually being in the graph. You can confirm this also by running a simple query to count the datasets in the graph in Stardog Studio, which should return 340.

surchs commented 1 year ago

That sounds reasonable as an explanation. I would say we close this issue here and then

[x] make a new issue to investigate / reproduce and fix the issue of fewer than expected datasets being returned
[x] Make a note of the one dataset with nan that we couldn't process. We can then check whether we're about to correctly handle the missing values with the annotation tool.

I think it makes more sense to address the missing values like this instead of adding more workarounds a I initially suggested.

Could you create the new tracking issue for the #dataset mismatch and then close?

neurobagel / planning

Update bulk OpenNeuro annotations following Neurobagel vocab changes #50

Relevant vocab-related changes

Steps to perform update