Top-level structure matching dataset names is good, but under that, we should probably have dedicated directory structure for the outputs of pre-processing scripts or for the inputs to pre-processing; in either segregating them.
Pre-processing
Make some sort of criteria for why we would use pre-processing so we don't use overuse it, e.g. for joining multiple files in to one file for cooperation with our limited pipelining configuration system. Maybe this can be limited to zipping the two files together so that they can be fetched and handled by a custom pipeline step.
Do we care?
We hope, in our next grant, to migrate away from our current processing paradigm to doing all processing in cloud services. Does this matter enough to tackle in the short/medium term?
Directory structure
Top-level structure matching dataset names is good, but under that, we should probably have dedicated directory structure for the outputs of pre-processing scripts or for the inputs to pre-processing; in either segregating them.
Pre-processing
Make some sort of criteria for why we would use pre-processing so we don't use overuse it, e.g. for joining multiple files in to one file for cooperation with our limited pipelining configuration system. Maybe this can be limited to zipping the two files together so that they can be fetched and handled by a custom pipeline step.
Do we care?
We hope, in our next grant, to migrate away from our current processing paradigm to doing all processing in cloud services. Does this matter enough to tackle in the short/medium term?