healthri-imaging / xnat-data-structure

XNAT Data Structure specification
Creative Commons Attribution 4.0 International
0 stars 2 forks source link

Thoughts on linkage with Arcana "data columns" #1

Open tclose opened 1 year ago

tclose commented 1 year ago

Following on from the discussion I had with Hakim last night, there seems to be a strong convergence between what you are going for with this data structure and what I have been working on for "dataset" definitions in Arcana v2.

Since I wasn't going for anything general, the YAML definition file I use has some Arcana-specific classes/concepts in it but I think it covers quite a few of your requirements. It would be cool to make the definition Arcana uses more general and human-understandable, just as long as it doesn't convolute things downstream too much.

Here is an example Arcana dataset definition generated by Arcana's unittests. It gets stored with the data in a project-level resource.

class: <arcana.core.data.set:Dataset>
columns:
  a_sink:
    class: <arcana.core.data.column:DataSink>
    datatype: <arcana.data.types.common:Text>
    name: a_sink
    path: a_sink
    pipeline_name: null
    row_frequency: <arcana.core.utils.testing.data.sets:TestDataSpace>[abcd]
    salience: <arcana.core.analysis.salience:ColumnSalience>[supplementary]
  a_source:
    class: <arcana.core.data.column:DataSource>
    datatype: <arcana.data.types.common:Text>
    header_vals: null
    is_regex: false
    name: file1
    order: null
    path: file1
    quality_threshold: null
    row_frequency: <arcana.core.utils.testing.data.sets:TestDataSpace>[abcd]
exclude: []
hierarchy:
- <arcana.core.utils.testing.data.sets:TestDataSpace>[a]
- <arcana.core.utils.testing.data.sets:TestDataSpace>[b]
- <arcana.core.utils.testing.data.sets:TestDataSpace>[c]
- <arcana.core.utils.testing.data.sets:TestDataSpace>[d]
id: /var/folders/mz/yn83q2fd3s758w1j75d2nnw80000gn/T/tmpjes1f2y_/full
id_inference: []
include: []
pipelines: {}
pkg_versions:
  arcana: v0.2b1+84.gc5c455a
space: <arcana.core.utils.testing.data.sets:TestDataSpace>

The columns part is probably the most relevant to what you have described. Note that I have separate "source" and "sink" columns to distinguish between acquired data (e.g. a T1w scan) and derivatives that are "sunk" into session resources (although I/we have plans to develop use a custom "derivatives" datatype to store the outputs more nicely).

To address the requirements directly and how they would map onto Arcana features:

As a data user I want:

  • To find the T1w scan because I want to extract information from it

In Arcana, this is covered by "path", "quality-threshold", "header-values" and "order" criteria (see Rows and Columns), which are used to match each "row" in the "data frame". Note that the dataset definition has "include" and "exclude" attributes to allow the user to specify only the rows (e.g. imaging sessions) in which all columns have a matching scan/resource.

NB: "Finding" is only relevant for "source" columns, as for sink columns you specify where the data should be stored and therefore know how to find it later.

  • To know where and how to store the output of my image analysis workflow so I and others can find it later

see previous comment re "sink" columns

  • To know what information is stored in this file “ambigousfile.extension” because it might contain relevant information for my research

This is handled by the "datatype" attribute. I spoke with Hakim about trying to merge our parallel efforts to define "data types" (such as different file formats) into an common upstream package

  • To know how many patients have a T1w scan with a fully annotated brain segmentation by observer Radiologist 1 to extract information from it

I spoke with Hakim about this use case, and there are a couple of ways you could look to handle this within Arcana using extended data spaces, although it would not be trivial and well outside the scope of this discussion. Creating an "AnnotatedT1w" data type (with extensible) would be the most straightforward way to store/access the data, but picking out Radiologist 1's annotation across the project would require an extended data space with an additional "annotator" dimension.

  • To know by who/when/how a “file.extension” was uploaded

How do you envisage this being handled? For "sink" columns I was planning to store full provenance information alongside the data but I note that this is out of scope. It would seem that most of these definitions would sit at the XNAT project level, but this would have to be per-resource.

  • To explore what kind of segmentation files are available for a certain scan/patient

If I understand correctly, this would just be a set of sink columns

  • To know whether a certain scan session is pre-operative or post-operative

If this is part of the session label (or identifiable by some other simple criteria), these sessions could be sorted into different "timepoints" in the "Clinical" data space.

  • To easily be able to upload a folder structure with derived files to my project

As a data controller I want:

  • Define the structure of my data collection, so it is easy to share it with others
  • Define the structure of my data collection, so I can collect summary statistics or aggregates to export to a data catalog in robust, automatic fashion

Arcana has a CLI that enables you do define and save the data set definitions above

  • Others to be able to understand the data in my project without me having to explain everything

It would be good to make the dataset definitions more human-readable/transportable

  • To know if the data collections adhere to the defined data structures and as such are valid, because I need my repository to be adhering to the FAIR principles

I am halfway through writing a CLI cmd that would scan a project to see whether there are matching scans in each session within the project.

  • To semantically annotate my scans (e.g. DICOM series) and expose this to the users or a catalogue to inform them what a scan contains

Technical requirements :

  • The structure will be defined in a yaml or json file
  • Optionally: a json schema definition need to be implemented to do validity checking of the data structure yaml/json file
  • The structure file needs to be human readable whilst also machine parsable for being able to be used in data structure validity checking or data catalogue exporting, etc.

This sounds good to me. I like YAML although JSON seems more popular in general.

Non-goals / out-of-scope

  • Provenance tracking (although provenance potentially benefits from documenting your data structure)

See point above re provenance

In terms of things we need for Arcana that your requirements don't cover, the main things I can think of are

One additional feature we are looking to implement (outside of Arcana), which occurred to me is related is the ability to check the acquisition parameters of newly added sessions to ensure they match the study protocol. This is a heavily requested feature and good dovetail nicely with a description that could then be used for analysis/export to BIDS, etc... Could probably be done by using the "header values" criterium now I come to think about it.

hachterberg commented 1 year ago

Thank you @tclose for writing your thoughts down. I think you have some interesting additions (like the quality threshold) that might be useful.

I would be more than happy to work together on a generic datatypes package that we can both use. In fastr datatypes are very bare-bone by default and almost anything is optional. Maybe we can distill a good model that is generic enough to fit both our needs.

One additional feature we are looking to implement (outside of Arcana), which occurred to me is related is the ability to check the acquisition parameters of newly added sessions to ensure they match the study protocol.

I once wrote a piece of software that can check XNAT sessions against a protocol by checking the scans and their DICOM headers and matching it to a schema. If you want I can show you some day to see if it is worth dusting off.