How should we structure analysis-type workflows?

asogaard commented 2 years ago

The repo should be able to support entire analysis workflows, from data staging (from Cobalt), data conversion, model training, deployment, inference, plotting, etc. Furthermore, it would be beneficial if entire analysis workflow (e.g. corresponding to physics analyses, training of optimised models, etc.) were unambiguously defined and fully reproducible. This may be as simple as having a directory structure like

analyses/
|-- my-analysis/
|   |-- 01-rsync.sh
|   |-- 02-convert-i3.sh
|   `-- 03-(...).sh
`-- another-analysis/

with numbered bash scripts that can be run in sequence. Alternatively, we could look into tools like Apache Airflow for configuring and managing workflows as directed acyclic graphs (DAGs) or even use GitHub Actions directly, using self-hosted runners (with GPU support).

If people have any opinions on this, please leave your thoughts in the comments below.

asogaard commented 2 years ago

Removing this from the v1.0.0 roadmap, as analyses have been moved to graphnet-team/analyses, so while this issue is still very important, it is not a necessary conditions to reach v1.0.0.

MortenHolmRep commented 2 years ago

Just giving some input, we have been organizing our directory as

`analyses/
|-- my-analysis/
|--|--data/ # data staging
|   |   |-- 01-pipeline_<data1>.sh  # creates a db based on data1
|   |   |-- 02-pipeline_<data2>.sh  # creates a db based on data2
|   |   |-- 03-convert_i3_files.py # converts data based on parameters in bash script
|--|--deployment/ # deploying a trained model
|   |   |-- 01-pipeline_inference_<data1>.sh # performs inference on data1
|   |   |-- 02-run_inference_on_sqlite.py # runs inference based on parameters in bash script
|--|--modelling/ # training a model
|   |   |-- 01 -pipeline_training_<data1>.sh # performs training based on data1
|   |   |-- 02 -train_reconstruction_individual_azimuth_zenith.py # specific task
|   |   |-- 03 -train_reconstruction_joined_azimuth_zenith.py # specific task
|--|--plotting/ # plotting based on all of the above
|   |   |   |-- 'plot_all.sh
|--|--|--distribution/
|   |   |   |-- 01-plot_all_in_folder.sh
|   |   |   |-- 02-pulse_count_and_duration_distribution.py
|   |   |   |-- 03-more_plotting_scripts.py
|--|--|--|-reconstruction/
|   |   |   |-- 01-bash.sh
|   |   |   |-- 02-plotting_script.py
|   |   |   |-- 03-more_plotting_scripts.py
|--|--|--|-other_plot_categories/
`-- another-analysis/`

Optimally a folder contains 1 script for a given task. i.e converting i3 files to the database. Bash files are then created per action, i.e. if I have multiple datasets I would employ multiple shell scripts. Right now we use argparse for the input of variables. although I actually made a branch that has a class with Typer that works.

Below is a pseudo-code example template I made in our analysis folder for i3 file conversion:

#!/bin/bash
# Description: <this is a template, add relevant description here>

# to recreate the results, follow the steps below by designating;
# (1) directory containing I3-files, including gcd file.
database_directory=/groups/icecube/petersen/GraphNetDatabaseRepository/...
# (2) output directory for created database.
output_directory=/groups/icecube/petersen/GraphNetDatabaseRepository/...
# (3) report output location and name
report_directory=/groups/icecube/${USER}/storage/nohup_reports/
report_name=<name of report>
# (4) pulsemaps to extract using featureextractor; found via investigating the I3-files using dataio-shovel in IceTray.
pulsemaps=(pulse1, pulse2, ..., pulsen)
# (5) run shell in terminal using "bash <path_to_file>.sh"

## do not alter beyond this point ##
# date for report name
TIMESTAMP=$(date "+%H%M%S")

# if directories does not exist for reporting and output, creates them.
mkdir -p ${output_directory};
mkdir -p ${report_directory};

# save the report file to:
report_location=${report_directory}${report_name}${TIMESTAMP}.out

# Starts IceTray
eval `/cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/setup.sh`
/cvmfs/icecube.opensciencegrid.org/py3-v4.1.0/RHEL_7_x86_64/metaprojects/combo/stable/env-shell.sh

# runs script from anywhere, as long as bash file is in directory.
nohup python $(dirname -- "$(readlink -f "${BASH_SOURCE}")")/convert_i3_files.py & \
-db ${database_directory} \
--output ${output_directory} \
--pulsemaps ${pulsemaps[@]} \
> ${report_location}

# exit IceTray
exit

MortenHolmRep commented 2 years ago

Would it be an idea to divide the scripts in ../examples/ into categories matching an analysis folder structure template?

graphnet-team / graphnet

How should we structure analysis-type workflows? #105