Discussion on customisation

neuroinformatics-unit / NeuroBlueprint

Lightweight data specification for systems neuroscience, inspired by BIDS.

http://neuroblueprint.neuroinformatics.dev/

Creative Commons Attribution 4.0 International

17 stars 1 forks source link

Discussion on customisation #57

Open JoeZiminski opened 4 months ago

JoeZiminski commented 4 months ago

On the datashuttle/NWB roadmap we have this deliverable:

Support for customising data structure (e.g. specific mandatory elements or new file types)

The level of customisability is a general consideration for specifications (e.g. BIDS 2.0). A benefit of allowing customisations (and automated conversion between them) facilitates adoption and makes researchers lives easier. A downside is that it can be complex / error prone to implement and may dilute some of the benefits of standardisation.

It will be useful to discuss the specifics of the kind of customisations that people want can which we could support. @adamltyson what kind of requests have you had?

adamltyson commented 4 months ago

Some requests I've had (and some thoughts of my own) include:

Specify new datatypes - if a lab did fMRI, fUSI and 2p, they may not want to lump them all in together
Requirements for specific files https://github.com/neuroinformatics-unit/datashuttle/issues/73

I'm not sure if this is a neuroblueprint issue. Perhaps it's a datashuttle one? NeuroBlueprint could be a particular specification. datashuttle will support custom versions for folder creation/validation/transfer, but we won't support them in analysis software etc.

JoeZiminski commented 4 months ago

Hmm yes I think it's a bit of both. For #73 this is definitely a datashuttle thing, once we have a metadata standard implemented I think this would be a natural extension to metadata validation.

For the datatypes, maybe this is a discussion for NeuroBlueprint first (maybe a new issue) as it is tricky problem but should be possible to support existing needs with a combination of datatypes and employing BIDS-like suffixes. If we can handle these concerns within the spec it would be great, a BIDS-like solution would be to have fmri in an fmri datatype and introduce _fusi and _2p as suffixes in funcimg. Maybe we change funcimg to funcmicr, but these would be discussions to have.

In general I would rather put forward a workable standard solution based on BIDS and ask people to try it. If they try it for a project and after 2 weeks say 'this is making my life more difficult and is not workable' then that's a problem. Or, if we have problems with adoption because people are immediately put off, despite us offering tooling for data management and analysis, then we should think about customisation. But for me customisations are a last resort if there are no other workable alternatives and our carrots are not working, as it immediately dilutes standardisation. For example if we allow customised datatypes, within a year we have "2p", "2photon", "2-PH0T0n", tP", knocking around and we've not really solved the problem.

adamltyson commented 4 months ago

I agree with all the above. I think the one thing we will need to do somehow is validate the existance of specific files within the existing NB structure (e.g. metadata).

JoeZiminski commented 4 months ago

The more I think about the datatype issue, I wonder if the best approach is just to have a different datatype for every conceivable technique out there. This is a large divergence from BIDS, but it would be easy to write converters. The suffix approach makes a lot of sense in MRI where you have lots of different sequences that are slightly different but are basically fmri or anat but I'm not sure it works so well in general.

In systems neuroscience lots of quite different techniques come under anat or confocal. At the moment I'm working with @viktorpm on the capsid project with cscope, ephys, 2p, confocal.

With the current setup we have somethings like:

.
└── sub-001/
    ├── ses-001_dtype-cscope/
    │   └── funcimg/
    │       └── (cscope data)
    ├── ses-002_dtype-2p/
    │   └── anat/
    │       └── 2p/
    │           └── (2p data)
    └── ses-003_dtype-confocal/
        └── anat/
            └── confocal/
                └── (confocal data)

So all of the dtype information needs to be promoted to the session name to work around the meaningless anat / funcimg datatype. I think the suffix approach is nice in theory but in practice, I don't think many reserachers would want to mix datatypes in the same folder, where tracking them is entirely dependent on adding a suffix to all filenames.

All of these problems will be solved by having more granular datatype names. The only downside I can think of is sometimes these might be weird (e.g. ephyse vs. ephysi) but generally these clashes are quite rare and are superficial vs. structural like the current problem. But still definately a concern. I am also worried about diverging with BIDS in such a big way, although automating between these will be easy (converter to change the granular datatype name e.g. 2P to anat and append 2P to all folder / variable names).

adamltyson commented 4 months ago

I like this approach. I think diverting from BIDS is fine, as long as there is a strong reason for it. NeuroBlueprint was never meant to be exactly the same as BIDS, otherwise it would be BIDS!

JoeZiminski commented 4 months ago

Hey @niksirbi what do you think of this? If in agreement next steps could be to propose a list of datatypes and get feedback on this idea from SWC users. My guess is most would approve but probably worth getting wider feedback before making any chances to the spec.

niksirbi commented 4 months ago

I've been pondering this for a while and I'm internally torn. I'll try to summarise my thoughts so far:

NeuroBlueprint already diverges significantly from BIDS at the datatype level. For example we have anat, which is an existing BIDS datatype, but we use it to mean something else. We also have funcimg which strictly speaking should be micr (for funcitonal microscopy) and func (for fMRI). We have behav instead of beh, and they don't really mean the same thing exactly. My point is, since we are already playing fast and loose with datatype names, there is not much further harm by increasing the number of datatype options.
The BIDS distinction between datatypes (subfolder below session) and modalities (specified via the suffix), makes sense for MRI (where you have sequences), but not so much for techniques used in systems neuroscience. We could keep this structure and use modality as a "sub-datatype", a way to increase the granularity of a datatype via the suffix. However I agree with @JoeZiminski that this may end up confusing people even more. So I see the sense of ditching modalities and just implementing more datatypes to begin with. Anyway, the distinction between what warrants being a fully-fledged datatype vs a mere modality is largely a subjective one, a bit like deciding whether something is a "language" or a "dialect".
If people are doing fMRI/EEG/MEG, i.e. techniques for which BIDS is long-established and widely used, we should just advise them to use BIDS, not NeuroBlueprint. The difficulty arises when the above techniques are combined with some "systems neuro" techniques in the same project. Doesn't happen often, but may become more common. In that case, there may be benefit to putting the fMRI/EEG/MEG data into the NeuroBlueprint project, via extra datatypes. Personally, that's not what I would do though. I would create two "projects": 1 BIDS for BIDS-y datatypes, and 1 NeuroBlueprint for the NeuroBlueprint-y datatypes. I would just make sure to use matching subject/session names to establish correspondence between the two projects. All that to say, I don't see much value in us worrying about methods like fMRI, because there are already established standards for that. We should cover fUSI though.

So to summarise, I'm fine with increasing the number of datatypes, instead of introducing modalities.

Now I come to my biggest concern, which is establishing a list of NeuroBlueprint "datatypes". I'd have no idea how to do that, no two people would agree about what warrants being put in the same "datatype" folder vs in different ones. Any decision we make on that will be largely arbitrary (similarly to distinguishing between datatypes and modalities). The example Joe gave above, already showcases that (I would have probably "split" the data differently).

Perhaps we should take the radically flexible approach to allow users/labs to pre-specify the list of desired datatypes per project. This could be in the form of a json or yaml file, where poeple could specify the datatype short name (the name of the folder, e.g. behav), long name (behaviour) and a full description of what's supposed to be in there ("Videos of behaving animals"). Later we could even add more fields to this, such as allowed_file_types etc. NeuroBlueprint could provide some preset datatypes, but people will be able to define custom ones (including completely overriding NeuroBlueprint's presets).

On the other hand, radical flexibility is at odds with standardisation, which is the whole point of specifications, so as I said, I'm torn. Ultimately, I trust @JoeZiminski to make the final call on this, since he is the one who best knows what is feasible to implement in datashuttle.

niksirbi commented 4 months ago

Another customisation-related issue which was not mentioned above, is allowing projects to skip the session and/or datatype levels, if they only have one. BIDS already allows the skipping of session. I'd personally want that feature, to avoid over-nesting.

JoeZiminski commented 3 months ago

Thanks @niksirbi for that summary I agree on all points. It is not straightforward and even small annoyances (e.g. having to call ephys ecephys) could block uptake. One way around having to make many decisions on this could be to merge the concept of BIDS datatype and modality. We can have 'high level' datatypes that can be used use in most cases, and 'low level datatypes' that can be used if a user a) wants a more specific datatype name b) have two different modalities that fit into the same 'high level datatype'.

The high-level datatypes are what we have already (anat, funcimg, behav, ephys). The 'low-level' datatypes are taken from BIDS modalities where possible. This gets around the problem of having to ask people to call their data-type folders something weird if they dont want to, but provdes an alternative if necessary. It is also backwards compatible. I think we can still commandeer anat for anything anatomical related and just drop the micr and shift all micr modalities to anat. For example, the section in the spec on datatypes could be:

An example

The datatype folder is where data from different acquisition modalities is put. We define a number of high-level datatypes that should suffice for most use cases:

ephys: electrophysiology (e.g. Neuropixel probes, tetrodes) behav: behavioural (e.g. video and audio files, response logs) funcimg: functional imaging (e.g. calcium and voltage imaging) anat: anatomical (e.g. histology, using confocal or lightsheet)

In some cases (ephys, funcimg, anat) these datatype names might be too broad. If you want a more spefic folder name for your datatype, or are using multiple techniques that fall under the same high-level datatype (for example, intracellular and extracellular ephys), you can use one of the refined datatype names below.

Refined datatype names

You can replace the high-level datatype name with one of the refined datatype names below. If a refined datatype name is used, the corresponding high-level datatype must not be used. **ephys** `icephys`: intracellular electrophysiology `ecephys`: extracellular electrophysiology **anat** Maybe we can introduce a `mri` low-level datatype, just in case somone is collecting `mri` anatomy + some other imaging data modality. We probably don't need all of these that BIDS defines, but if there is overlap we can use their abbreviation? ![image](https://github.com/neuroinformatics-unit/NeuroBlueprint/assets/55797454/aa74c82f-b7ad-4284-acad-e0af7f1cd2d7) + fusi **funcimg** `cscope`

JoeZiminski commented 3 months ago

For reference a list of BIDS datatypes / modalities can be found here. As far as I can tell, only MRI and microscopy really make use of them.

A downside of the above is it does mean data could possibly be, at maximum, in two places. From the datashuttle aspect it is not a problem, low-level datatypes is just another datatype. For data-discovery, I think it should not add too much complexity (if there is no "2pe" folder check for an "anat" folder).

EDIT: This does also (possibly?) create a problem if somone is using the same technqiue for for two difference purposes. For example they may be using 2-photon for both functional imaging and anatomy. Then they would have two 2pe folders but for different use-cases. However, they can just use anat or funcimg (or both). This would only be a problem if they had 2 functional imaging and two anat techniques (in which cause they'd have to not use the high-level datatype, but would have a clashing low level datatype, 2pe). This is an extreme edge case, and actually not a problem as could use different sessions names.

niksirbi commented 3 months ago

Hey Joe, I like your idea of 'high-level' vs 'low-level' datatypes, because it doesn't break with the current schema and it allows for considerable flexibility. That said, I have some thoughts to share.

Broad vs Narrow datatypes

Let's not call them high-leve vs low-level datatypes, because that implies a nested hierarchical structure, and that's not what we want (we don't want a low-level datatype sub-folder within a high-level datatype folder). I suggest using broad vs narrow datatypes instead, because it nicely captures that their main difference is the breadth of scope.

Potential narrow datatypes for each broad datatype

`ephys`

I like using icephys vs ecephys as a good starting point for the narrow ephys datatypes (I think NWB also uses the same terms?). That should cover most use-cases, unless someone uses multiple different types of ecephys for example (but I think that's exceedingly rare).

`anat`

I also like using the micr modalities as the narrow datatypes for anat, with the possible addition of mri for the edge case you mentioned (but I think we should add a warning that if you have lots of MRI data, consider making a separate BIDS-compliant dataset for it). I think the existing micr modalities should cover most use-cases. For example, 2PE and SPIM alone would cover most whole-brain data acquired at SWC. We can add more narrow types in future if specific needs arise (or new techniques become popular).

`behav`

I'm unsure how to define narrow datatypes for behav. I could invent some now, anticipating the researchers' needs, but they would be quite arbitrary. Instead, I propose not having narrow behav datatypes for now, and introduce them later if and when the need arises (based on "real-world" use-cases).

`funcimg`

This is a tricky case, defining narrow types for it won't be easy. For example, I think cscope is maybe too narrow to be generally useful. The basic trouble is that the various techniques employed for functional imaging differ along at least two axes: which proxy of neural activity they measure, and which type of equipment they use to measure it.

For example, if you split them by what they measure, you'd get something like:

hemodynamic (based on BOLD / blood flow / blood volume): fUSI, BOLD-fMRI, fNIRS, ISOI (intrinsic-signal optical imaging).
calcium
voltage

If you split them by how they are measured, you'd get completely different categories. For example you'd probably have to distinguish between 2-photon microscopy, miniscopes, fiber photometry, MRI, NIRS, widefield microscopy etc (essentially recreating some of the narrow datatypes in anat.

In general, funcimg is a bit of uncharted territory, because BIDS doesn't include it, so there is not much previous work to rely on.

Name collisions between narrow datatypes

This does also (possibly?) create a problem if somone is using the same technqiue for for two difference purposes. For example they may be using 2-photon for both functional imaging and anatomy. Then they would have two 2pe folders but for different use-cases. However, they can just use anat or funcimg (or both).

I think this can be easily circumvented. If narrow datatypes end up sharing the same name for anat and funcimg, we could add the prefix f or func to the latter. For example, we could have f2PE instead ot 2PE.

JoeZiminski commented 3 months ago

Thanks @niksirbi I agree 'Broad' vs. 'Narrow' datatypes is much better names and we can use those going forward. I agree on all datatypes, although for the cscope issue I think it is okay to have very specific datatypes. Even if they are infrequently used, if they well-capture a use case and do not overlap with any other datatypes I can't see the harm in including them. Apart from that I agree it is really not clear how best to name these. Indeed do we have 'calcium imaging' vs. '2pe' vs. 'gcamp'? How about we proceed by including all datatypes that seem natural, and anywhere it is not clear we will discuss / poll researchers and settle on what feels most natural. The capsid project we are working on with @viktorpm seems a good place to start this as it contains a few different imaging datatypes. @adamltyson @niksirbi @viktorpm maybe we can meet sometime to discuss this.

I guess the two main aims for the datatype names is to be:

specific enough for researchers to distinguish datatypes with little overlap
general enough to ease discoverability. Although, I think this is less of a problem because if we maintain a canonical list of datatypes this can be used for all discoverability issues.

In terms of implementing this I think it only requires: 1) updating the NeuroBlueprint as described above 2) extending the backend of datashuttle to handle any datatype name (hopefully not too painful) 3) expose this in a neat way in the TUI.