[Validation] Move to a new package the logic behind folder naming and validation

lauraporta commented 11 months ago

Is your feature request related to a problem? Please describe. Multiple projects need to infer folder names according to NeuroBluePrint specification for automated data processing and reading. Datashuttle currently owns part of the implementation required for the generation and the validation of these names. However other projects are trying to handle the same technical necessities:

Describe the solution you'd like As discussed in person, we could export this functionality in a standalone package. The package should enable the user to:

generate strings with folder names
specify which set of key-value pairs we want in our session and subject folder
validate these names
expose objects containing the structure and the path to these folders
being able to specify if we want raw/derivatives folders or ephys/anatomy/etc folders
infer the next id number for session and subject
handle cases of duplication
having additional tools for batch analysis, as analysied_files.log and exceptions_per_job_id.log

Describe alternatives you've considered An alternative could be importing datashuttle and using a dedicated module.

JoeZiminski commented 11 months ago

Thanks Laura! I would propose the below plan:

1) Continue to list here and in #199 the required functionality as you have done 2) factor out / create all of this functionalty into datashuttle modules that implement this in functions. 3) try and use these from within datashuttle and see what the best approach is to structure it as a standalone module. There are a few possible options here as you note and it's hard to tell at this stage what will be the best pattern going forward but this is a good place to discuss.

(a bit of a tangent, looking forward)

For example, to perform validation (I think) it will always be necessary to hold the concept of the project root folder. We will also want a nice API to avoid making lots of calls to disperse functions, to which we will need to pass the project root. We will probably end up building a class that takes a project root and then exposes convenience functions like project.make_sub_folder. In this case project is an instantiation of a lightweight class that takes only a project root as input. But then this becomes quite close to what datashuttle already is, but with restricted functionality, so duplication and divergence may become an issue.

In this case it might be easier to make a common base class and have two datashuttle child classes, one that takes a project-path and provides basic functionality (make folders, validation). The other takes a project name and extends to the full functionality. rclone is the only heavy dependency (i.e. it requires a conda install) and is required only for data-transfer, so you could perform folder creation and validation without it. However, this pattern will probably be extremely confusing for users!

adamltyson commented 11 months ago

However, this pattern will probably be extremely confusing for users!

I think this could be key. It may be that having everything within DataShuttle makes the most sense for us, but a separate package (e.g. neuroblueprint-api) may be clearer for other users who want to adopt the spec.

JoeZiminski commented 11 months ago

as discussed in #197 , look into repeated arguments shared between functions used in validation / file discovery e.g. verbose. These are passed around a lot and docstring repeated. If factoring into a class during this restructuring it will be easier to have such variables as class attributes.

JoeZiminski commented 9 months ago

Thinking more about this after a recent refactoring of validation #261, it would be great to get a few more @neuroinformatics-unit/neuroinformatics-all thoughts on what this package might / might not implement from datashuttle.

Currently datashuttle does some annoying stuff if you want to use it for quickly making and validating a local project e.g:

makes a config file in user's home with lots of back-end datashuttle configs, one folder per project
adds logs to a .datashuttle path in the local_path
requires setting up configs including with configs for setting up transfers to a central.

I think something super-quick and easy to use would be preferred? where all you need to feed it is the project name (if that) and the path to rawdata. It would not log, nor require any config creation. It would provide make-folders, some validation functions, some convenience functions. Would this work?

Also, currently datashuttle does not really hold information about the specific files / folders in a project. The closest is building a list of paths to transfer based on user input in data_transfer.py (but this code is not very nice). But there is no persistent tree-like object tracking all the files / folders in the project. Is this something people require or are doing?

neuroinformatics-unit / datashuttle

[Validation] Move to a new package the logic behind folder naming and validation #198