Closed alyssadai closed 1 year ago
For reviewer:
You can check the changes to the open_neuro
:
/attributes/{data_element_URI}
endpoint should still return a list of the instances used in the graph (since the vocabulary file has been added back to the graph after clearing it for the data update)340
:
SELECT (COUNT(*) as ?datasets)
WHERE {
SELECT DISTINCT ?dataset
WHERE {
?dataset a nb:Dataset;
nb:hasLabel ?dataset_name
}
}
Note that while 341/441 total ON datasets succeeded the CLI step (with manual help in the case of 1 dataset) and so had JSONLD files regenerated, only 340 were successfully uploaded because one JSONLD (ds002000) had NaN
values for age which cannot be parsed. This is due to the JSON for this dataset not including "NaN"
in the missing values for age, and the CLI then converting that to a float value -> nan
.
This is due to the JSON for this dataset not including "NaN" in the missing values for age, and the CLI then converting that to a float value -> nan.
Would it make sense to just add this value to the hard-coded default missing values here: https://github.com/neurobagel/bulk_annotations/blob/313447e642324684a3e2e39d2f35a7661206653d/process_annotation_to_dict.py#L105
One more question @alyssadai: is the new graph already live for the openneuro API? If I'm running an empty query through the query tool I get 334 datasets, 16206 subjects which might be the old number, but I cannot recall exactly
Hey @surchs,
Would it make sense to just add this value to the hard-coded default missing values here: https://github.com/neurobagel/bulk_annotations/blob/313447e642324684a3e2e39d2f35a7661206653d/process_annotation_to_dict.py#L105
I think that looks right. If I understand/remember the code correctly the other function with hardcoded missing values get_missing
in that script is for handling the discrete columns, right? So I think the function you indicated would be the right spot to address the current issue.
is the new graph already live for the openneuro API? If I'm running an empty query through the query tool I get 334 datasets, 16206 subjects
Yes, the live graph contains the updated OpenNeuro data. You can also confirm this by querying for healthy controls - previously this would have returned nothing since the in-graph TermURLs for healthy control were out of date with our other tools.
I think we ran into this issue previously of not all datasets/subjects in the graph being returned in an empty query. I’ll investigate this again tmrw but I think our hypothesis last time was that the current query sent by the API assumes there's at least one session per subject in each dataset, but subjects with only phenotypic data currently don't have sessions due to our data model. As a result, it’s possible that datasets with subjects who have imaging data that can’t be modelled somehow just never match any API query to the graph (?).
TLDR, I think this is a problem with our query template in the API rather than those “missing” datasets not actually being in the graph. You can confirm this also by running a simple query to count the datasets in the graph in Stardog Studio, which should return 340.
That sounds reasonable as an explanation. I would say we close this issue here and then
I think it makes more sense to address the missing values like this instead of adding more workarounds a I initially suggested.
Could you create the new tracking issue for the #dataset mismatch and then close?
Relevant vocab-related changes
Namespace used for healthy controls/subject groups (JSON)
Age heuristics (JSON)
Incorporation of
nb
term vocabulary (graph database)Steps to perform update
purl:NCIT_C94342
->ncit:C94342
nb:{float,int,euro,bounded,iso8061}
->nb:{FromFloat,FromInt,FromEuro,FromBounded,FromISO8061}
nb:iso8061
OpenNeuroDatasets-JSONLD
forks, and push the updated JSONs to the forksnb_vocab.ttl
file