Open JoeZiminski opened 4 months ago
Some requests I've had (and some thoughts of my own) include:
I'm not sure if this is a neuroblueprint issue. Perhaps it's a datashuttle one? NeuroBlueprint could be a particular specification. datashuttle will support custom versions for folder creation/validation/transfer, but we won't support them in analysis software etc.
Hmm yes I think it's a bit of both. For #73 this is definitely a datashuttle thing, once we have a metadata standard implemented I think this would be a natural extension to metadata validation.
For the datatypes, maybe this is a discussion for NeuroBlueprint first (maybe a new issue) as it is tricky problem but should be possible to support existing needs with a combination of datatypes and employing BIDS-like suffixes. If we can handle these concerns within the spec it would be great, a BIDS-like solution would be to have fmri
in an fmri
datatype and introduce _fusi
and _2p
as suffixes in funcimg
. Maybe we change funcimg
to funcmicr
, but these would be discussions to have.
In general I would rather put forward a workable standard solution based on BIDS and ask people to try it. If they try it for a project and after 2 weeks say 'this is making my life more difficult and is not workable' then that's a problem. Or, if we have problems with adoption because people are immediately put off, despite us offering tooling for data management and analysis, then we should think about customisation. But for me customisations are a last resort if there are no other workable alternatives and our carrots are not working, as it immediately dilutes standardisation. For example if we allow customised datatypes, within a year we have "2p", "2photon", "2-PH0T0n", tP", knocking around and we've not really solved the problem.
I agree with all the above. I think the one thing we will need to do somehow is validate the existance of specific files within the existing NB structure (e.g. metadata).
The more I think about the datatype issue, I wonder if the best approach is just to have a different datatype for every conceivable technique out there. This is a large divergence from BIDS, but it would be easy to write converters. The suffix approach makes a lot of sense in MRI where you have lots of different sequences that are slightly different but are basically fmri
or anat
but I'm not sure it works so well in general.
In systems neuroscience lots of quite different techniques come under anat
or confocal
. At the moment I'm working with @viktorpm on the capsid
project with cscope
, ephys
, 2p
, confocal
.
With the current setup we have somethings like:
.
└── sub-001/
├── ses-001_dtype-cscope/
│ └── funcimg/
│ └── (cscope data)
├── ses-002_dtype-2p/
│ └── anat/
│ └── 2p/
│ └── (2p data)
└── ses-003_dtype-confocal/
└── anat/
└── confocal/
└── (confocal data)
So all of the dtype information needs to be promoted to the session name to work around the meaningless anat / funcimg datatype. I think the suffix approach is nice in theory but in practice, I don't think many reserachers would want to mix datatypes in the same folder, where tracking them is entirely dependent on adding a suffix to all filenames.
All of these problems will be solved by having more granular datatype names. The only downside I can think of is sometimes these might be weird (e.g. ephyse
vs. ephysi
) but generally these clashes are quite rare and are superficial vs. structural like the current problem. But still definately a concern. I am also worried about diverging with BIDS in such a big way, although automating between these will be easy (converter to change the granular datatype name e.g. 2P
to anat
and append 2P
to all folder / variable names).
I like this approach. I think diverting from BIDS is fine, as long as there is a strong reason for it. NeuroBlueprint was never meant to be exactly the same as BIDS, otherwise it would be BIDS!
Hey @niksirbi what do you think of this? If in agreement next steps could be to propose a list of datatypes and get feedback on this idea from SWC users. My guess is most would approve but probably worth getting wider feedback before making any chances to the spec.
I've been pondering this for a while and I'm internally torn. I'll try to summarise my thoughts so far:
anat
, which is an existing BIDS datatype, but we use it to mean something else. We also have funcimg
which strictly speaking should be micr
(for funcitonal microscopy) and func
(for fMRI). We have behav
instead of beh
, and they don't really mean the same thing exactly. My point is, since we are already playing fast and loose with datatype names, there is not much further harm by increasing the number of datatype options.So to summarise, I'm fine with increasing the number of datatypes, instead of introducing modalities.
Now I come to my biggest concern, which is establishing a list of NeuroBlueprint "datatypes". I'd have no idea how to do that, no two people would agree about what warrants being put in the same "datatype" folder vs in different ones. Any decision we make on that will be largely arbitrary (similarly to distinguishing between datatypes and modalities). The example Joe gave above, already showcases that (I would have probably "split" the data differently).
Perhaps we should take the radically flexible approach to allow users/labs to pre-specify the list of desired datatypes per project. This could be in the form of a json
or yaml
file, where poeple could specify the datatype short name (the name of the folder, e.g. behav
), long name (behaviour
) and a full description of what's supposed to be in there ("Videos of behaving animals"). Later we could even add more fields to this, such as allowed_file_types
etc. NeuroBlueprint could provide some preset datatypes, but people will be able to define custom ones (including completely overriding NeuroBlueprint's presets).
On the other hand, radical flexibility is at odds with standardisation, which is the whole point of specifications, so as I said, I'm torn.
Ultimately, I trust @JoeZiminski to make the final call on this, since he is the one who best knows what is feasible to implement in datashuttle
.
Another customisation-related issue which was not mentioned above, is allowing projects to skip the session
and/or datatype
levels, if they only have one. BIDS already allows the skipping of session
. I'd personally want that feature, to avoid over-nesting.
Thanks @niksirbi for that summary I agree on all points. It is not straightforward and even small annoyances (e.g. having to call ephys ecephys) could block uptake. One way around having to make many decisions on this could be to merge the concept of BIDS datatype and modality. We can have 'high level' datatypes that can be used use in most cases, and 'low level datatypes' that can be used if a user a) wants a more specific datatype name b) have two different modalities that fit into the same 'high level datatype'.
The high-level datatypes are what we have already (anat
, funcimg
, behav
, ephys
). The 'low-level' datatypes are taken from BIDS modalities where possible. This gets around the problem of having to ask people to call their data-type folders something weird if they dont want to, but provdes an alternative if necessary. It is also backwards compatible. I think we can still commandeer anat
for anything anatomical related and just drop the micr
and shift all micr
modalities to anat
. For example, the section in the spec on datatypes could be:
An example
The datatype folder is where data from different acquisition modalities is put. We define a number of high-level datatypes that should suffice for most use cases:
ephys
: electrophysiology (e.g. Neuropixel probes, tetrodes)
behav
: behavioural (e.g. video and audio files, response logs)
funcimg
: functional imaging (e.g. calcium and voltage imaging)
anat
: anatomical (e.g. histology, using confocal or lightsheet)
In some cases (ephys
, funcimg
, anat
) these datatype names might be too broad. If you want a more spefic folder name for your datatype, or are using multiple techniques that fall under the same high-level datatype (for example, intracellular and extracellular ephys), you can use one of the refined datatype names below.
For reference a list of BIDS datatypes / modalities can be found here. As far as I can tell, only MRI and microscopy really make use of them.
A downside of the above is it does mean data could possibly be, at maximum, in two places. From the datashuttle aspect it is not a problem, low-level datatypes is just another datatype. For data-discovery, I think it should not add too much complexity (if there is no "2pe" folder check for an "anat" folder).
EDIT:
This does also (possibly?) create a problem if somone is using the same technqiue for for two difference purposes. For example they may be using 2-photon for both functional imaging and anatomy. Then they would have two 2pe
folders but for different use-cases. However, they can just use anat
or funcimg
(or both). This would only be a problem if they had 2 functional imaging and two anat techniques (in which cause they'd have to not use the high-level datatype, but would have a clashing low level datatype, 2pe
). This is an extreme edge case, and actually not a problem as could use different sessions names.
Hey Joe, I like your idea of 'high-level' vs 'low-level' datatypes, because it doesn't break with the current schema and it allows for considerable flexibility. That said, I have some thoughts to share.
Let's not call them high-leve vs low-level datatypes, because that implies a nested hierarchical structure, and that's not what we want (we don't want a low-level datatype sub-folder within a high-level datatype folder). I suggest using broad vs narrow datatypes instead, because it nicely captures that their main difference is the breadth of scope.
ephys
I like using icephys
vs ecephys
as a good starting point for the narrow ephys
datatypes (I think NWB also uses the same terms?). That should cover most use-cases, unless someone uses multiple different types of ecephys
for example (but I think that's exceedingly rare).
anat
I also like using the micr
modalities as the narrow
datatypes for anat
, with the possible addition of mri
for the edge case you mentioned (but I think we should add a warning that if you have lots of MRI data, consider making a separate BIDS-compliant dataset for it). I think the existing micr
modalities should cover most use-cases. For example, 2PE
and SPIM
alone would cover most whole-brain data acquired at SWC. We can add more narrow types in future if specific needs arise (or new techniques become popular).
behav
I'm unsure how to define narrow datatypes for behav
. I could invent some now, anticipating the researchers' needs, but they would be quite arbitrary. Instead, I propose not having narrow behav
datatypes for now, and introduce them later if and when the need arises (based on "real-world" use-cases).
funcimg
This is a tricky case, defining narrow types for it won't be easy. For example, I think cscope
is maybe too narrow to be generally useful. The basic trouble is that the various techniques employed for functional imaging differ along at least two axes: which proxy of neural activity they measure, and which type of equipment they use to measure it.
For example, if you split them by what they measure, you'd get something like:
fUSI
, BOLD-fMRI
, fNIRS
, ISOI
(intrinsic-signal optical imaging).If you split them by how they are measured, you'd get completely different categories. For example you'd probably have to distinguish between 2-photon microscopy, miniscopes, fiber photometry, MRI, NIRS, widefield microscopy etc (essentially recreating some of the narrow datatypes in anat
.
In general, funcimg
is a bit of uncharted territory, because BIDS doesn't include it, so there is not much previous work to rely on.
This does also (possibly?) create a problem if somone is using the same technqiue for for two difference purposes. For example they may be using 2-photon for both functional imaging and anatomy. Then they would have two 2pe folders but for different use-cases. However, they can just use anat or funcimg (or both).
I think this can be easily circumvented. If narrow datatypes end up sharing the same name for anat
and funcimg
, we could add the prefix f
or func
to the latter. For example, we could have f2PE
instead ot 2PE
.
Thanks @niksirbi I agree 'Broad' vs. 'Narrow' datatypes is much better names and we can use those going forward. I agree on all datatypes, although for the cscope
issue I think it is okay to have very specific datatypes. Even if they are infrequently used, if they well-capture a use case and do not overlap with any other datatypes I can't see the harm in including them. Apart from that I agree it is really not clear how best to name these. Indeed do we have 'calcium imaging' vs. '2pe' vs. 'gcamp'? How about we proceed by including all datatypes that seem natural, and anywhere it is not clear we will discuss / poll researchers and settle on what feels most natural. The capsid project we are working on with @viktorpm seems a good place to start this as it contains a few different imaging datatypes. @adamltyson @niksirbi @viktorpm maybe we can meet sometime to discuss this.
I guess the two main aims for the datatype names is to be:
In terms of implementing this I think it only requires: 1) updating the NeuroBlueprint as described above 2) extending the backend of datashuttle to handle any datatype name (hopefully not too painful) 3) expose this in a neat way in the TUI.
On the datashuttle/NWB roadmap we have this deliverable:
The level of customisability is a general consideration for specifications (e.g. BIDS 2.0). A benefit of allowing customisations (and automated conversion between them) facilitates adoption and makes researchers lives easier. A downside is that it can be complex / error prone to implement and may dilute some of the benefits of standardisation.
It will be useful to discuss the specifics of the kind of customisations that people want can which we could support. @adamltyson what kind of requests have you had?