Update raw data + pipeline-specific columns in bagel schema

alyssadai commented 1 year ago

Current schema and README describing column attributes in schema can be found here: https://github.com/neurobagel/proc_dash/tree/main/schemas

Proposed changes

[x] has_mri_data column will now be optional ("IsRequired"=false)
[x] Create new optional prefixed column HAS_CONTRAST__ (or something similar) corresponding to the hasContrastType property in Neurobagel data model
[x] Add second prefix to PHASE_ and STAGE_ columns with the pipeline name + version to group these according to pipeline, and change prefix separator from _ to __ (double underscore)
- so PHASE_ becomes PHASE__<pipeline_name>-<pipeline_version>__, e.g. PHASE__fmriprep-20.2.7__anat
[ ] (NOT PLANNED FOR NOW) add another attribute to each column in the schema (in addition to "IsRequired", "dtype", etc.) called "TrackerSource" (?) that describes what kind of data (bagel) the column appears in/comes from
- this is a preemptive organizational feature so that in future, if we want to differentiate types of columns (e.g., pheno vs imaging) visually in the dashboard, I can hopefully do that more easily
- for now, a given column will have one of the following values for "TrackerSource":
- "TrackerSource": "imaging" (covers both derivatives and raw/bids bagels, but we could also name these separately if helpful)
- "TrackerSource": "phenotypic"
- "TrackerSource": "all" (meaning it appears in all bagels - I guess this would be only participant_id)

Questions

[ ] (NOT URGENT) Adding a visit column (like session, but for phenotypic data?) was mentioned briefly. Do we want to add it in this iteration?
[x] Should we update all separators for prefixed columns ("IsPrefixedColumn"=true) to a double underscore __ for consistency? Affected columns would be:
- "DATATYPE_", "PHASE_", "STAGE_", "HAS_CONTRAST__" (or whatever we want to call the optional imaging contrast column)

alyssadai commented 1 year ago

@nikhil153 @michellewang here's the issue spec for the bagel schema updates we discussed today!

Let me know if this sounds good and if you have any feedback, particularly on the questions. 🙂

alyssadai commented 1 year ago

Had one other thought as I was updating the schema. Currently in bagel_schema.json, the key for a given column is the exact expected column name, e.g.

        "PHASE__": {
            "Description": ...,
            "dtype": "str",
            "IsRequired": false,
            "Range": ["SUCCESS", "FAIL", "INCOMPLETE", "UNAVAILABLE"],
            "IsPrefixedColumn": true

(where "PHASE__" is the expected column prefix). But I noticed that on mr_proc's end, this means you currently need to manually update your pipeline trackers (e.g., fmriprep_tracker.py) when there's a change to the column names in the schema, due to how these names are 'hard-coded' in your tracker_configs variables, e.g. https://github.com/neurodatascience/mr_proc/blob/43b3874af8e74167fba03e9abb096603264e0494/trackers/fmriprep_tracker.py#L104-L116

To avoid you needing to update all the pipeline trackers if the column names change, I'm wondering if it would be helpful to just encode the actual column name as another field, and treating the keys essentially 'variable names' instead (which won't change). So, for example, the snippet above would become something like:

        "phase": {
            "ColumnLabel": "PHASE__",
            "Description": ...,
            "dtype": "str",
            "IsRequired": false,
            "Range": ["SUCCESS", "FAIL", "INCOMPLETE", "UNAVAILABLE"],
            "IsPrefixedColumn": true

(For most of the non-prefixed columns, the key and ColumnLabel would probably be the same). This way, to fetch the appropriate prefix of "phase" columns for example when constructing the bagel.csv, you could just access the value of something like schema_group["phase"]["ColumnLabel"] (if schema_group is your dict storing the "PIPELINE_STATUS_COLUMNS" group from the schema). This way, even if the ColumnLabel value changes in the schema, when you generate a new bagel.csv, the columns should be updated automatically.

Without having tested this out, let me know if this sounds like it would make your lives easier @nikhil153 @michellewang. From a quick scan of your tracker scripts, it looks like with this change your function tracker.get_pipe_tasks might also need to be updated. Obviously if it seems like too much of a hassle, we can keep things as they are now :)

michellewang commented 1 year ago

I think this would work (though @nikhil153 should probably confirm). What we have now already works (and it's not too much trouble to update all the trackers if a key changes) and the new logic would be a little more complex but it could help if we want to scale things up in the future.

alyssadai commented 1 year ago

Okay in that case, for simplicity's sake I'll leave the creation of a specific 'column name' attribute to a separate schema enhancement, see #48. Let's continue the discussion there.

neurobagel / digest

Update raw data + pipeline-specific columns in bagel schema #44

Proposed changes

Questions