neuroinformatics-unit / NeuroBlueprint

Lightweight data specification for systems neuroscience, inspired by BIDS.
http://neuroblueprint.neuroinformatics.dev/
Creative Commons Attribution 4.0 International
17 stars 1 forks source link

Add high-level metadata to promote data re-use #30

Open adamltyson opened 1 year ago

adamltyson commented 1 year ago

We are agnostic about what form experimental metadata takes. This is to help adoption as labs can save metadata however they like. However, would it be useful to come up with a standard way of recording high level metadata (e.g. species, age, behavioural paradigm etc) in such a way that it can be searched and help promote data re-use?

My idea is something along these lines:

I think it's a nice idea in theory, but harder to implement. Something would be better than nothing though, and it would really need to be part of the data acquisition (very few people will go back and tag historical datasets). It should be part of DataShuttle and any other tools we develop in this space (i.e. could the compression/analysis tools add tags to the metadata?).

adamltyson commented 1 year ago

The BIDS metadata is along the same lines, but I think we need something more flexible and broad in scope.

adamltyson commented 7 months ago

One approach that I've been thinking of is:

In the future some tool could scrape NeuroBlueprint directories for this metadata and put it into a database of some sort. Researchers could then:

adamltyson commented 3 months ago

The more I think about this, the more I think we should promote (but not require) a single low-level metadata spec, like the AIND one. Another option is openminds.

I've been asked about metadata a lot, and it will be useful for us when building analysis pipelines.

adamltyson commented 3 months ago

openminds also has a brain atlas standard which could be used when linking with BrainGlobe tools.

JoeZiminski commented 3 months ago

Below is a write-up for the three options for metadata format (BIDS, Allen, openMINDS) with some thoughts on how to proceed. I had a look for other standards but couldn't find many others, though of course we should consider anything available so let me know if you know of others. An alternative approach would be to use the NWB schema's .yaml files, they are really strong on the time syncing stuff, but I think we'd have to extend it to our use case ourselves, so will ignore this option for now.

Format

BIDS has two metadata formats, either .tsv or .json. The .tsv is used for creating tables with multiple entries. Some examples include list of participants, samples, trial information. Otherwise. .json is used for all other metadata that are key-value pairs. For example, the dataset description and a lot of modality-specific metadata (e.g. behaviour or miroscopy. In general the metadata key-value pairs are very well documented, and split by modality.

The Allen uses .json for formatting all metadata, and have really extensive set of metadata entries specified. they provide very nice dashboard for manual creation of metadata files. The key-value pairs in the schema itself are not particularly well documented, I am not sure what many of these are referring to, for example in the 'session' metadata file. Reading their docs, it seems many of these fields are filled automatically during data acquisition / transfer by Allen infrastructure.

openMINDS used a JSON-LD format, (.jsonld). This allows linking across the metadata, this seems powerful but may introduce a lot of dependencies that could be hard to manage in our use case. The introduction guide has three possible syntaxes you can use, it allows flexibility but I found it difficult to follow. The metadata documentation is split by modality, the individual key-value pairs required are pretty well documented.

High-level organisation

BIDS

There is a high-level participants.tsv and dataset_description.json. Then, each data file can have a sidecar .json with associated metadata. For different modalities, the metadata to include in these sidecar .json can be found here. I'm not sure if the information in these sidecar jsons has to be stored one-per-datafile or if you can put it at a high level (e.g. subject, session or experiment) if it is the same for all subjects, sessions, etc. Possibly, as they have a nice 'inheritance' principle, where if for example some metadata is the same for 90/100 animals, you can put common metadata at a high level and then overwrite it for specific fields lower down (e.g. in that example,10 subjects would have subject-level metadata overwriting some fields on the project-wide metadata). In general, BIDS metadata in a project it would look like:

├─ raw//
│  ├─ CHANGES 
│  ├─ README 
│  ├─ channels.tsv 
│  ├─ dataset_description.json 
│  ├─ participants.tsv 
│  └─ sub-001/
│     └─ eeg/
│        ├─ sub-001_task-listening_events.tsv 
│        ├─ sub-001_task-listening_events.json 
│        ├─ sub-001_task-listening_eeg.edf 
│        └─ sub-001_task-listening_eeg.json 
└─ derivatives//
   ├─ descriptions.tsv 
   └─ sub-001/
      └─ eeg/
         ├─ sub-001_task-listening_desc-Filt_eeg.edf 
         ├─ sub-001_task-listening_desc-Filt_eeg.json 
         ├─ sub-001_task-listening_desc-FiltDs_eeg.edf 
         ├─ sub-001_task-listening_desc-FiltDs_eeg.json 
         ├─ sub-001_task-listening_desc-preproc_eeg.edf 
         └─ sub-001_task-listening_desc-preproc_eeg.json 

Allen

The Allen has 6 different metadata files:

The way that metadata is organised was for me not very intuitive, it is not split by modality. The 'Acquisition' file only covers imaging modality and I think. the other modalities go in 'session' but I'm not entirely sure. A lot is stored in the 'Rig / Instrument' file, but if say 'I am doing a behavioural or ephys data and I want to look up what fields I should include', it is harder to find. Most likely in the session metadata file and implicit based on your acquisition system? For me this is a drawback and I think reflects how the Allen is collecting data (on a large scale, with fairly standardised rigs).

They do not have an 'inheritance' principle but the way they get around this is to throw as much static information on the data collection in the 'Instrument' file, see" What's the difference between Rig and Session"). I think this will become a problem for many labs with less standardised setups, with a lot of duplicated metadata.

For the Allen, I'm not sure exactly where these are supposed to go in terms of a folder structure, maybe this is something we will have to mandate. Most are fairly self explanatory (e.g. subject in subjects, session in sessions).

openMINDS I think but am not 100% sure that metadata is completely separated from data. For example see the folder structure here and the note on linking to data here. I think there is a lot of dependencies between the metadata files through linking, I can see the advantages of this approach but I'm not sure how well this will translate to NeuroBlueprint.

Supported modalities

All have some form of 'subject' and 'dataset description'. There are divergences in what modalities they support.

BIDS currently supports: MRI, PET, NIRS, EEG, behaviour, microscopy, motion, there is ephys BEP. Their behavioural metadata for trials is pretty straightforward as well as 'task events (but this is more for fMRI). I'm not sure it has anything for timing and sync pulses, this may be awaiting the animal ephys BEP.

Allen currently supports: for sure some microscopy modalities, events and timings. I think their events and timings data is stored in NWB and they just mandate the associated metadata. More on this in the section below. They must have support for ephys and behaviour but it is folded into other sections and harder to find.

openMINDS has ephys and brain atlases. As well as experimental stimulus )(e.g. ephys) + some others. I'm not sure about behaviour data (e.g. trial information), I couldn't see anything obvious. For the stimulus, they have the metadata fields but its not clear to me how the timing and sync data are to be stored.

Tooling

I had a good look and couldn't find a comprehensive BIDS management tool for reading and writing metadata across all modalities. There is pybids for example but I believe it is more for neuroimaging. They have a lot of BIDS validators but as far as I could tell that was it, nothing on the writing side, but I'm not 100% sure.

Allen has a lot of really nice tooling, the dashboard for manually creating metadata as well as a very promising looking python writer / reader. I don't think the entire spec (e.g. acquisition, session) is supported yet.

openMINDS do have some tooling (matlab and python) but is quite young, the example on the repo looks nice. I think this is promising, but at the moment their example seems a little verbose but I think a lot of that is just getting the data from random format into openMINDS, and once you are in openMINDS format is it looks much simpler. I'm not sure how much of the spec is supported in the tooling, it is under active development.

Pros and Cons of each

Pros

BIDS

Pros

Cons

Allen

Pros

Cons

openMINDS

Pros

Cons

What is our use case? Can we support multiple formats?

For me, there is no clear 'winner' from the three metadata formats and each has their own area where they excell but nothing provides exactly what we want across the board. I'm not sure any of the above are currently in a place where I'd feel happy to strongly recommend a researcher to go and start using i.e. they go off and use without too much confusion and it would cover 90% of their use case.

Therefore I think it is worth looking at how we would need to interact with metadata and whether we can avoid making a strong recommendation on a standard to use. In that case, researchers can use the one best suited to their needs. We could recommend the most appropriate for use in the SWC (though I'm not sure what this would be) and write a blog covering the pros and cons of these existing metadata initiatives.

The downside of this is we would probably need to interoperate with all recommended specs in our analysis packages / datashuttle. The below explores what our 'points of contact' with metadata will be.

reading metadata This is something we may have to do in various modality-specific analysis tools. In particular, for timings and sync pulses. I will have a section on that key case below. Otherwise, for ephys, there is not a lot of acquisition-related metadata to read, anything pertinent (e.g. sampling rate etc.) is already handled by spikeinterface or can be easily passed by the user. I'm not sure for behaviour? For microscopy, Adam suggested (Orientation Voxel size, Species, Organ, Imaging type). In general, I think the parameter sets we will need to read will be fairly minimal and not too painful to handle from three separate metadata (it will basically be loading some jsons and mapping the names which refer to the same metadata across schemas).

writing metadata this is a bit trickier, though I'm not sure how much we will need to do it. For datashuttle, we may want to write subject metadata and dataset descriptions. This would be a bit of a pain supporting three schemas, but again not awful as we would basically collate all the information we want to write, then map to the keys of the metadata standards are write in json / json-ld. This would only have to be done once. It would be more of a problem if we decided we wanted to manage metadata more extensively e.g. writing metadata from raw data files (e.g. microscopy, ephys, behaviour). I'm not sure we want go to down the route of full metadata management though, if we did it would be a big initiative.

Events, time, sycning data

This is one area where we will have to interact in detail with whatever metadata standard we pick. I've tried to summarise as best possible the requirements for each schema as I understand them. All three are ofc metadata standards and as far as I can tell don't specify how to write the data. I think in general we can be quite flexible here (e.g. binary, numpy, csv, etc) they are all 1d timeseries so not going to be too complex.

BIDS As far as I can tell this has no schema for systems-neuroscience specific time data (e.g. sync pulses) and it is due for inclusion in the animal BEP. Closely related concepts are the task events and behaviour. I think this covers behavioural stimuli well but doesn't really address the time-syncing issue for systems neuroscience (?).

Allen

Allen has what seems to be a well developed method for handling events data, described here and in the spec here. It looks good, but honestly I would not really know how to go about using it. I guess to write the various time data to disk and have each represented in the session.json folder under one of the pydantic classes. This highlights an issue with the Allen scheme - if your acquisition system does not fall under one of their pre-defined classes you are kind of stuck.

openMINDS

openMINDS have a stimulation metadata section. I'm not sure how the data should actually be stored. I looks nice for ephys stimulus timings, but couldn't find where behaviour trial is supported.

Our own schema for this?

I think we need to survey what people are doing in the building and possibly introduce our own schema for this, based on what people are already doing, as lightweight as possible. We should support these other metadata schemas where relevant but I think all we really need are event timestamps mapped to each modality / a table of stimulus or behavioural event trial information.

My thoughts

Unfortunately I don't think any of the schema fully meet all our (extensive, unrealistic?) requirements, a flexible, intuitive and very well documented metadata standard for systems neuroscience that incorporates all our modalities of interest.

Then I guess a mild preference, is to write a blog on these metadata standards and their pros and cons, and how you could get started with them in NeuroBlueprint. Then we say any of them are allowed as long as you use them consistently.

This means we would also need to support all three in our tools. Because we only ever planning on actually mandating a small subset of most relevant metadata, this might be OK as discussed above. Maybe in future as these specifications progress and there is a clear' winner' we can actually mandate a particular approach. But at present, I'm not sure any are 100% sufficient for all use cases we will come across.

Before making any decision, we should definitely survey the building for 1) what metadata people are currently collecting 2) if they are, what format 3) how people are collecting timings and sync data 4) what kind of metadata people would like to see standardised.