Open tclose opened 1 year ago
Thank you @tclose for writing your thoughts down. I think you have some interesting additions (like the quality threshold) that might be useful.
I would be more than happy to work together on a generic datatypes package that we can both use. In fastr datatypes are very bare-bone by default and almost anything is optional. Maybe we can distill a good model that is generic enough to fit both our needs.
One additional feature we are looking to implement (outside of Arcana), which occurred to me is related is the ability to check the acquisition parameters of newly added sessions to ensure they match the study protocol.
I once wrote a piece of software that can check XNAT sessions against a protocol by checking the scans and their DICOM headers and matching it to a schema. If you want I can show you some day to see if it is worth dusting off.
Following on from the discussion I had with Hakim last night, there seems to be a strong convergence between what you are going for with this data structure and what I have been working on for "dataset" definitions in Arcana v2.
Since I wasn't going for anything general, the YAML definition file I use has some Arcana-specific classes/concepts in it but I think it covers quite a few of your requirements. It would be cool to make the definition Arcana uses more general and human-understandable, just as long as it doesn't convolute things downstream too much.
Here is an example Arcana dataset definition generated by Arcana's unittests. It gets stored with the data in a project-level resource.
The
columns
part is probably the most relevant to what you have described. Note that I have separate "source" and "sink" columns to distinguish between acquired data (e.g. a T1w scan) and derivatives that are "sunk" into session resources (although I/we have plans to develop use a custom "derivatives" datatype to store the outputs more nicely).To address the requirements directly and how they would map onto Arcana features:
In Arcana, this is covered by "path", "quality-threshold", "header-values" and "order" criteria (see Rows and Columns), which are used to match each "row" in the "data frame". Note that the dataset definition has "include" and "exclude" attributes to allow the user to specify only the rows (e.g. imaging sessions) in which all columns have a matching scan/resource.
NB: "Finding" is only relevant for "source" columns, as for sink columns you specify where the data should be stored and therefore know how to find it later.
see previous comment re "sink" columns
This is handled by the "datatype" attribute. I spoke with Hakim about trying to merge our parallel efforts to define "data types" (such as different file formats) into an common upstream package
I spoke with Hakim about this use case, and there are a couple of ways you could look to handle this within Arcana using extended data spaces, although it would not be trivial and well outside the scope of this discussion. Creating an "AnnotatedT1w" data type (with extensible) would be the most straightforward way to store/access the data, but picking out Radiologist 1's annotation across the project would require an extended data space with an additional "annotator" dimension.
How do you envisage this being handled? For "sink" columns I was planning to store full provenance information alongside the data but I note that this is out of scope. It would seem that most of these definitions would sit at the XNAT project level, but this would have to be per-resource.
If I understand correctly, this would just be a set of sink columns
If this is part of the session label (or identifiable by some other simple criteria), these sessions could be sorted into different "timepoints" in the "Clinical" data space.
Arcana has a CLI that enables you do define and save the data set definitions above
It would be good to make the dataset definitions more human-readable/transportable
I am halfway through writing a CLI cmd that would scan a project to see whether there are matching scans in each session within the project.
This sounds good to me. I like YAML although JSON seems more popular in general.
See point above re provenance
In terms of things we need for Arcana that your requirements don't cover, the main things I can think of are
One additional feature we are looking to implement (outside of Arcana), which occurred to me is related is the ability to check the acquisition parameters of newly added sessions to ensure they match the study protocol. This is a heavily requested feature and good dovetail nicely with a description that could then be used for analysis/export to BIDS, etc... Could probably be done by using the "header values" criterium now I come to think about it.