Ability to run one's own HTCondor instance

Ferryistaken commented 1 month ago

Hello,

I've ran the pipeline without HTCondor up until the processing results part (which I assume is not currently possible without running the pipeline in HTCondor unless I write a custom script that takes the non-HTCondor energize_output and packages it into a database understandable by metl).

From my understanding, it's unfeasible to generate a good enough training set without parallelizing the computation of rosetta's energy parameters for all variants. I've setup my own HTCondor instance to which I'm able to connect a few execute nodes, and would like to run metl-sim on my this cluster. The part that I don't understand is: do I really need to upload rosetta and python to osdf/squid if I'm running the algorithm only on my own machines? Or is there another way (such as adding the rosetta and python env to all execute nodes through my docker-compose)? I might be wrong, but it seems like I would only need to upload to squid if I'm connecting to a highly distributed HTCondor cluster to which I don't have admin privileges to right?

Where in the scripts are the osdf python/rosetta env being accessed? Is there a workaround to skip that step and instead use a local install?

agitter commented 1 month ago

Great question @Ferryistaken. We have been thinking about ways to make melt-sim more accessible to others. Our initial ideas were around OSG, but running HTCondor locally is another approach.

If you are running HTCondor locally, there are a lot of code and data preparation steps you can skip. You do not need to upload anything to osdf/squid, it can stay on your local machine.

Regarding your last question code/energize.py specifies the path to Rosetta:

$ python code/energize.py -h
usage: energize.py [-h] [--rosetta_main_dir ROSETTA_MAIN_DIR]
                   [--variants_fn VARIANTS_FN] [--chain CHAIN]
                   [--pdb_dir PDB_DIR]
                   [--allowable_failure_fraction ALLOWABLE_FAILURE_FRACTION]
                   [--mutate_default_max_cycles MUTATE_DEFAULT_MAX_CYCLES]
                   [--relax_repeats RELAX_REPEATS]
                   [--relax_nstruct RELAX_NSTRUCT]
                   [--relax_distance RELAX_DISTANCE] [--save_wd]
                   [--log_dir_base LOG_DIR_BASE] [--cluster CLUSTER]
                   [--process PROCESS] [--commit_id COMMIT_ID]

 this is the run script that executes on the server

optional arguments:
  -h, --help            show this help message and exit
  --rosetta_main_dir ROSETTA_MAIN_DIR
                        path to the main directory of the rosetta distribution
  --variants_fn VARIANTS_FN
                        path to text file containing protein variants
  --chain CHAIN         the chain to use from the pdb file
  --pdb_dir PDB_DIR     directory containing the pdb files referenced in
                        variants_fn
  --allowable_failure_fraction ALLOWABLE_FAILURE_FRACTION
                        fraction of variants that can fail but still consider
                        this job successful
  --mutate_default_max_cycles MUTATE_DEFAULT_MAX_CYCLES
                        number of optimization cycles in the mutate step
  --relax_repeats RELAX_REPEATS
                        number of FastRelax repeats in the relax step
  --relax_nstruct RELAX_NSTRUCT
                        number of structures (restarts) in the relax step
  --relax_distance RELAX_DISTANCE
                        distance threshold in angstroms for the residue
                        selector in the relax step
  --save_wd             set this flag to save the full working directory for
                        each variant
  --log_dir_base LOG_DIR_BASE
                        base output directory where log dirs for each run will
                        be placed
  --cluster CLUSTER     cluster (when running on HTCondor)
  --process PROCESS     process (when running on HTCondor)
  --commit_id COMMIT_ID
                        the github commit id corresponding to this version of
                        the code

@samgelman what other steps can @Ferryistaken skip in the setup?

samgelman commented 1 month ago

To add to what @agitter said, you should be able to use a local HTCondor with modifications to the existing framework.

The key files are:

The script condor.py that puts together everything needed for a condor run. It currently assumes a run on CHTC/OSG/OSDF, but it should be possible to use it for local runs with some modifications.
The submit file energize.sub. This will need to be modified so the job is correctly submitted to the local HTCondor installation. I'm guessing some flags like +WantFlocking = true and +WantGlideIn = true, which are currently set in the file, won't be needed for a local run. This file also contains some keywords like {osdf_rosetta_distribution}, {osdf_python_distribution}, and {transfer_input_files}. These keywords get replaced with actual values by the condor.py script. These specify file paths to the execute nodes for the packaged Rosetta and Python distributions and additional input files. You may be able to leave these as-is and just specify local paths instead of OSDF paths in osdf_rosetta_distribution.txt and osdf_python_distribution.txt
The shell script which runs on the execute node run.sh. This file sets up the environment for Rosetta and launches energize.py with the appropriate arguments. Depending on how you package and transfer Python and Rosetta, this file will need to be modified.

Hopefully this is enough to get started, and if you encounter additional questions, I would be happy to help.

samgelman commented 1 month ago

I wanted to add that if you plan to have a local install of Python and Rosetta on each execute node, then you won't need to transfer those from the submit node. You would need to modify the files I listed above, especially run.sh, to assume that Python/Rosetta are already installed on the execute nodes, rather than needing to be packaged and set up.

Ferryistaken commented 1 month ago

Yes, my current architecture involves having the python and rosetta installs on each execute node, and a modification to the run.sh and energize.sub to account for that. I might work on making a script and doing a PR if you guys would be interested.

agitter commented 1 month ago

@Ferryistaken I would be interested in having you contribute your solution back to this repo if you get everything working. We would need to decide the best way to organize that based on how many files you modified and how much they changed.

gitter-lab / metl-sim

Ability to run one's own HTCondor instance #3