Closed alyssadai closed 1 year ago
@nikhil153 @michellewang here's the issue spec for the bagel schema updates we discussed today!
Let me know if this sounds good and if you have any feedback, particularly on the questions. 🙂
Had one other thought as I was updating the schema. Currently in bagel_schema.json, the key for a given column is the exact expected column name, e.g.
"PHASE__": {
"Description": ...,
"dtype": "str",
"IsRequired": false,
"Range": ["SUCCESS", "FAIL", "INCOMPLETE", "UNAVAILABLE"],
"IsPrefixedColumn": true
(where "PHASE__"
is the expected column prefix). But I noticed that on mr_proc's end, this means you currently need to manually update your pipeline trackers (e.g., fmriprep_tracker.py) when there's a change to the column names in the schema, due to how these names are 'hard-coded' in your tracker_configs
variables, e.g.
https://github.com/neurodatascience/mr_proc/blob/43b3874af8e74167fba03e9abb096603264e0494/trackers/fmriprep_tracker.py#L104-L116
To avoid you needing to update all the pipeline trackers if the column names change, I'm wondering if it would be helpful to just encode the actual column name as another field, and treating the keys essentially 'variable names' instead (which won't change). So, for example, the snippet above would become something like:
"phase": {
"ColumnLabel": "PHASE__",
"Description": ...,
"dtype": "str",
"IsRequired": false,
"Range": ["SUCCESS", "FAIL", "INCOMPLETE", "UNAVAILABLE"],
"IsPrefixedColumn": true
(For most of the non-prefixed columns, the key and ColumnLabel
would probably be the same). This way, to fetch the appropriate prefix of "phase" columns for example when constructing the bagel.csv
, you could just access the value of something like schema_group["phase"]["ColumnLabel"]
(if schema_group
is your dict storing the "PIPELINE_STATUS_COLUMNS"
group from the schema). This way, even if the ColumnLabel
value changes in the schema, when you generate a new bagel.csv, the columns should be updated automatically.
Without having tested this out, let me know if this sounds like it would make your lives easier @nikhil153 @michellewang. From a quick scan of your tracker scripts, it looks like with this change your function tracker.get_pipe_tasks might also need to be updated. Obviously if it seems like too much of a hassle, we can keep things as they are now :)
I think this would work (though @nikhil153 should probably confirm). What we have now already works (and it's not too much trouble to update all the trackers if a key changes) and the new logic would be a little more complex but it could help if we want to scale things up in the future.
Okay in that case, for simplicity's sake I'll leave the creation of a specific 'column name' attribute to a separate schema enhancement, see #48. Let's continue the discussion there.
Current schema and README describing column attributes in schema can be found here: https://github.com/neurobagel/proc_dash/tree/main/schemas
Proposed changes
has_mri_data
column will now be optional ("IsRequired"=false
)HAS_CONTRAST__
(or something similar) corresponding to thehasContrastType
property in Neurobagel data modelPHASE_
andSTAGE_
columns with the pipeline name + version to group these according to pipeline, and change prefix separator from_
to__
(double underscore)PHASE_
becomesPHASE__<pipeline_name>-<pipeline_version>__
, e.g.PHASE__fmriprep-20.2.7__anat
"IsRequired"
,"dtype"
, etc.) called"TrackerSource"
(?) that describes what kind of data (bagel) the column appears in/comes from"TrackerSource"
:"TrackerSource": "imaging"
(covers both derivatives and raw/bids bagels, but we could also name these separately if helpful)"TrackerSource": "phenotypic"
"TrackerSource": "all"
(meaning it appears in all bagels - I guess this would be onlyparticipant_id
)Questions
visit
column (likesession
, but for phenotypic data?) was mentioned briefly. Do we want to add it in this iteration?"IsPrefixedColumn"=true
) to a double underscore__
for consistency? Affected columns would be:"DATATYPE_"
,"PHASE_"
,"STAGE_"
,"HAS_CONTRAST__"
(or whatever we want to call the optional imaging contrast column)