neurobagel / digest

Web dashboard for neuroimaging-derived and phenotypic datasets
https://digest.neurobagel.org
MIT License
2 stars 2 forks source link

Update raw data + pipeline-specific columns in bagel schema #44

Closed alyssadai closed 1 year ago

alyssadai commented 1 year ago

Current schema and README describing column attributes in schema can be found here: https://github.com/neurobagel/proc_dash/tree/main/schemas

Proposed changes

Questions

alyssadai commented 1 year ago

@nikhil153 @michellewang here's the issue spec for the bagel schema updates we discussed today!

Let me know if this sounds good and if you have any feedback, particularly on the questions. 🙂

alyssadai commented 1 year ago

Had one other thought as I was updating the schema. Currently in bagel_schema.json, the key for a given column is the exact expected column name, e.g.

        "PHASE__": {
            "Description": ...,
            "dtype": "str",
            "IsRequired": false,
            "Range": ["SUCCESS", "FAIL", "INCOMPLETE", "UNAVAILABLE"],
            "IsPrefixedColumn": true

(where "PHASE__" is the expected column prefix). But I noticed that on mr_proc's end, this means you currently need to manually update your pipeline trackers (e.g., fmriprep_tracker.py) when there's a change to the column names in the schema, due to how these names are 'hard-coded' in your tracker_configs variables, e.g. https://github.com/neurodatascience/mr_proc/blob/43b3874af8e74167fba03e9abb096603264e0494/trackers/fmriprep_tracker.py#L104-L116

To avoid you needing to update all the pipeline trackers if the column names change, I'm wondering if it would be helpful to just encode the actual column name as another field, and treating the keys essentially 'variable names' instead (which won't change). So, for example, the snippet above would become something like:

        "phase": {
            "ColumnLabel": "PHASE__",
            "Description": ...,
            "dtype": "str",
            "IsRequired": false,
            "Range": ["SUCCESS", "FAIL", "INCOMPLETE", "UNAVAILABLE"],
            "IsPrefixedColumn": true

(For most of the non-prefixed columns, the key and ColumnLabel would probably be the same). This way, to fetch the appropriate prefix of "phase" columns for example when constructing the bagel.csv, you could just access the value of something like schema_group["phase"]["ColumnLabel"] (if schema_group is your dict storing the "PIPELINE_STATUS_COLUMNS" group from the schema). This way, even if the ColumnLabel value changes in the schema, when you generate a new bagel.csv, the columns should be updated automatically.

Without having tested this out, let me know if this sounds like it would make your lives easier @nikhil153 @michellewang. From a quick scan of your tracker scripts, it looks like with this change your function tracker.get_pipe_tasks might also need to be updated. Obviously if it seems like too much of a hassle, we can keep things as they are now :)

michellewang commented 1 year ago

I think this would work (though @nikhil153 should probably confirm). What we have now already works (and it's not too much trouble to update all the trackers if a key changes) and the new logic would be a little more complex but it could help if we want to scale things up in the future.

alyssadai commented 1 year ago

Okay in that case, for simplicity's sake I'll leave the creation of a specific 'column name' attribute to a separate schema enhancement, see #48. Let's continue the discussion there.