Thoughts on linkage with Arcana "data columns"

Following on from the discussion I had with Hakim last night, there seems to be a strong convergence between what you are going for with this data structure and what I have been working on for "dataset" definitions in Arcana v2.

Since I wasn't going for anything general, the YAML definition file I use has some Arcana-specific classes/concepts in it but I think it covers quite a few of your requirements. It would be cool to make the definition Arcana uses more general and human-understandable, just as long as it doesn't convolute things downstream too much.

Here is an example Arcana dataset definition generated by Arcana's unittests. It gets stored with the data in a project-level resource.

class: <arcana.core.data.set:Dataset>
columns:
  a_sink:
    class: <arcana.core.data.column:DataSink>
    datatype: <arcana.data.types.common:Text>
    name: a_sink
    path: a_sink
    pipeline_name: null
    row_frequency: <arcana.core.utils.testing.data.sets:TestDataSpace>[abcd]
    salience: <arcana.core.analysis.salience:ColumnSalience>[supplementary]
  a_source:
    class: <arcana.core.data.column:DataSource>
    datatype: <arcana.data.types.common:Text>
    header_vals: null
    is_regex: false
    name: file1
    order: null
    path: file1
    quality_threshold: null
    row_frequency: <arcana.core.utils.testing.data.sets:TestDataSpace>[abcd]
exclude: []
hierarchy:
- <arcana.core.utils.testing.data.sets:TestDataSpace>[a]
- <arcana.core.utils.testing.data.sets:TestDataSpace>[b]
- <arcana.core.utils.testing.data.sets:TestDataSpace>[c]
- <arcana.core.utils.testing.data.sets:TestDataSpace>[d]
id: /var/folders/mz/yn83q2fd3s758w1j75d2nnw80000gn/T/tmpjes1f2y_/full
id_inference: []
include: []
pipelines: {}
pkg_versions:
  arcana: v0.2b1+84.gc5c455a
space: <arcana.core.utils.testing.data.sets:TestDataSpace>

The columns part is probably the most relevant to what you have described. Note that I have separate "source" and "sink" columns to distinguish between acquired data (e.g. a T1w scan) and derivatives that are "sunk" into session resources (although I/we have plans to develop use a custom "derivatives" datatype to store the outputs more nicely).

To address the requirements directly and how they would map onto Arcana features:

As a data user I want:

To find the T1w scan because I want to extract information from it

In Arcana, this is covered by "path", "quality-threshold", "header-values" and "order" criteria (see Rows and Columns), which are used to match each "row" in the "data frame". Note that the dataset definition has "include" and "exclude" attributes to allow the user to specify only the rows (e.g. imaging sessions) in which all columns have a matching scan/resource.

NB: "Finding" is only relevant for "source" columns, as for sink columns you specify where the data should be stored and therefore know how to find it later.

To know where and how to store the output of my image analysis workflow so I and others can find it later

see previous comment re "sink" columns

To know what information is stored in this file “ambigousfile.extension” because it might contain relevant information for my research

This is handled by the "datatype" attribute. I spoke with Hakim about trying to merge our parallel efforts to define "data types" (such as different file formats) into an common upstream package

To know how many patients have a T1w scan with a fully annotated brain segmentation by observer Radiologist 1 to extract information from it

I spoke with Hakim about this use case, and there are a couple of ways you could look to handle this within Arcana using extended data spaces, although it would not be trivial and well outside the scope of this discussion. Creating an "AnnotatedT1w" data type (with extensible) would be the most straightforward way to store/access the data, but picking out Radiologist 1's annotation across the project would require an extended data space with an additional "annotator" dimension.

To know by who/when/how a “file.extension” was uploaded

How do you envisage this being handled? For "sink" columns I was planning to store full provenance information alongside the data but I note that this is out of scope. It would seem that most of these definitions would sit at the XNAT project level, but this would have to be per-resource.

To explore what kind of segmentation files are available for a certain scan/patient

If I understand correctly, this would just be a set of sink columns

To know whether a certain scan session is pre-operative or post-operative

If this is part of the session label (or identifiable by some other simple criteria), these sessions could be sorted into different "timepoints" in the "Clinical" data space.

To easily be able to upload a folder structure with derived files to my project

As a data controller I want:

Define the structure of my data collection, so it is easy to share it with others

Define the structure of my data collection, so I can collect summary statistics or aggregates to export to a data catalog in robust, automatic fashion

Arcana has a CLI that enables you do define and save the data set definitions above

Others to be able to understand the data in my project without me having to explain everything

It would be good to make the dataset definitions more human-readable/transportable

To know if the data collections adhere to the defined data structures and as such are valid, because I need my repository to be adhering to the FAIR principles

I am halfway through writing a CLI cmd that would scan a project to see whether there are matching scans in each session within the project.

To semantically annotate my scans (e.g. DICOM series) and expose this to the users or a catalogue to inform them what a scan contains

Technical requirements :

The structure will be defined in a yaml or json file

Optionally: a json schema definition need to be implemented to do validity checking of the data structure yaml/json file

The structure file needs to be human readable whilst also machine parsable for being able to be used in data structure validity checking or data catalogue exporting, etc.

This sounds good to me. I like YAML although JSON seems more popular in general.

Non-goals / out-of-scope

Provenance tracking (although provenance potentially benefits from documenting your data structure)

See point above re provenance

In terms of things we need for Arcana that your requirements don't cover, the main things I can think of are

"row frequency" - which level in the data tree the items belong to, e.g. imaging session, subject, time-point, etc...
"salience" - applicable for sink columns, used to signify how important the generated derivatives are in the scheme of things, i.e. are they going to be used in a publication or just a temporary step along way that is useful for debugging/qc
"quality-threshold" - ties in with XNAT's quality flag, which can be used to specify whether a particular acquisition is usable or not. This threshold enables you to select only the images that meet are of quality or above. I would also like to expand this in the future to include quality thresholds of intermediate derivatives (particularly if we have a custom datatype to make it easy to mark this up via the XNAT's UI).

One additional feature we are looking to implement (outside of Arcana), which occurred to me is related is the ability to check the acquisition parameters of newly added sessions to ensure they match the study protocol. This is a heavily requested feature and good dovetail nicely with a description that could then be used for analysis/export to BIDS, etc... Could probably be done by using the "header values" criterium now I come to think about it.

healthri-imaging / xnat-data-structure

Thoughts on linkage with Arcana "data columns" #1