File naming conventions

JoeZiminski commented 1 year ago

There has been some discussion here on the best file naming convention. Should the sub/ses info be at the file level? e.g.

└── project_name/
    └── raw_data/
        └── sub-001/
            └── ses-001_date-01012022/
                └── sub-001_ses-001_date-01012022_video.mp4
or not
.
└── project_name/
    └── raw_data/
        └── sub-001/
            └── ses-001_date-01012022/
                └── video.mp4

My preference is for the longer version, even though it is ugly it is completely unamgibious and protects against some possible worst-case-scenario bugs (that should absolutely never happen, but still e.g. copying a session data to the wrong session / subject). I think this is also neuroimaging BIDS preferred but might be wrong.

The problem is filenames will be created by users, so will be harder to enfore. We could at least provide some functionality to copy to clipboard the correct prefix based on the cwd or something.

adamltyson commented 1 year ago

The problem is filenames will be created by users, so will be harder to enfore

Can we enforce this at all? What if the data generated is 10k files from some third party software?

There's a discussion to be had to as to where we fall on this spectrum:

standard data formats <-> any data format with some standardisation (filename, metadata etc) <-> generic "buckets" (directories) to save data into

JoeZiminski commented 1 year ago

Yes this is true we can't enforce it at all, maybe it is best left up to the user and we can provide a recommendation.

Yes this is another key point, how 'far down' the tree we manage and how much we leave to the user. The current preferred choice folder structure is nice for this as we can just provide 'behav', 'ephys' at the data type level (i.e.one level below the session level) and leave it at that (rather than ses-001/ephys/behav/camera etc there was before).

For now, we could leave things agnoistic from below the data-type level? i.e. below, its possible to copy everything in the ephys, or behav, or imaging etc. for a selection of subjects, sessions, but there is no finer grained control (although, there is already a function for specifying a full path to a file to transfer)

.
└── project_name/
    └── raw_data/
        └── sub-001/
            ├── ses-001/
            │   ├── behav/
            │   │   ├── video.mp4
            │   │   └── responses.csv
            │   ├── ephys/
            │   │   └── recording.bin
            │   └── imaging/
            │       └── some_filetype.whatever
            └── histology/
                └── brain.tiff

In future either provide options for additional structure if it is very widely used, or alternatively support for creating / searching with custom strings.

Related, is everyone happy with a single folder for histology at the subject level? I think this makes sense

@adamltyson @niksirbi @lauraporta

adamltyson commented 1 year ago

Yes this is true we can't enforce it at all, maybe it is best left up to the user and we can provide a recommendation.

I agree. Maybe some docs on good practice, and we could even print out a recommended filename string.

For now, we could leave things agnoistic from below the data-type level?

I think for now this is the best idea.

In future either provide options for additional structure if it is very widely used, or alternatively support for creating / searching with custom strings.

Yep, as the tool is adopted, we could provide support for a limited set of acquisition setups. Bonsai etc.

Related, is everyone happy with a single folder for histology at the subject level? I think this makes sense

Agree. In future we could potentially provide support for whole-brain, sections, spatial-transcriptomics etc.

niksirbi commented 1 year ago

I agree with Adam.

My personal preference would be including sub/ses in the filename, just how BIDS does it, but it will be a headache to enforce for everything. Even BIDS started by supporting a few acquisition types (e.g. T1w, BOLD EPI) before expanding to others.

Context on requirements vs recommendations

When BIDS validates datasets (see bids-validator), it differentiates between REQUIRED, RECOMMENDED, and OPTIONAL. So if a REQUIRED feature is violated, you get an error, whereas if a RECOMMENDED feature is violated you get a warning. We could have a similar stratified system, and as different versions of datashuttle roll out we could promote some RECOMMENDED features to REQUIRED (while ensuring backwards compatibility).

For now I propose the following:

We enforce only the folder structure up to the data type level
We make some recommendations regarding file naming and filetypes
If a specific acquisition type is used very often (e.g. videos saved by bonsai), we can think about specific filenaming schemes for that and ultimately offer functionalities to 'BIDSify' (aka rename) the files.

Conclusion

The standard itself should be versioned, improved upon, and expanded by trial and error. Let's start small with minimum requirement and see where we go from there

niksirbi commented 1 year ago

Additionally, we can offer some (non-enforced) guidelines on how to store metadata. E.g. use .csv or .tsv for tables/dataframes and .json file for key-value pairs. If a specific metadata file, or table pertains to a specific subject/session/acquisition, it's name should reflect that.

niksirbi commented 1 year ago

Also @JoeZiminski , nitpicking some things I noticed in your example directory trees above:

I would use rawdata, instead of raw_data (to follow BIDS, there is no reason to differentiate ourselves here)
The dates I would write in YYYYMMDD format, because it's the least ambiguous considering international standards, it naturally sorts in chronological order, and it's the BIDS and ISO recommended format. So e.g. ses-002_date-20221110.

JoeZiminski commented 1 year ago

That's really nice, how do you think it is best to manage the documentation for the standard vs. datashuttle implementation? Shall we have a single (versioned) help page which introduces BIDS , makes reccomendations? and use the current ephys BEP for the formal standard?

cheers for those points I will open / amend isues

niksirbi commented 1 year ago

You are right in the sense that datashuttle (the tool) is not the same thing as the standard. It's probably more a tool that helps you implement the standard.

The standard itself (let's tentatively call its BIDS-SWC) is probably best hosted as a separate repo, which will be solely documentation, similar to bids-specification.

In the future we might also implement a tool like bids-validator to check whether a dataset is BIDS-SWC compliant.

We should of course monitor (and contribute to) BEP029 and BEP032, and strive to converge with them over time. That said, the in-house needs already go beyond these BEPs.

Those were my first thoughts on this, so fully open to counter-points.

JoeZiminski commented 1 year ago

I think thats a good approach, completely agree

adamltyson commented 1 year ago

In terms of docs, we could have multiple repos containing docs, or directories containing docs. These could all be rendered with Sphinx, and hosted using github pages. Something like :

github.com/neuroinformatics-unit/datashuttle/docs -> neuroinformatics-unit.github.io/datashuttle
github.com/BIDS-SWC -> neuroinformatics-unit.github.io/BIDS-SWC

and other tools e.g.

github.com/neuroinformatics-unit/behaviour-pipeline/docs -> neuroinformatics-unit.github.io/behaviour-pipeline

As adoption increases, the repos and docs could be moved to either SWC, or their own organisation.

neuroinformatics-unit / datashuttle