helizabeth1103 commented 4 days ago

Hello,

Previously I had an issue where the parameters I was using were not producing checkpoints in the model training step. I know that choosing parameters has a component of guesswork and iteration, and I was wondering if there are recommendations anywhere on how to choose a starting point for model training parameters, or if there are descriptions somewhere of what changing a particular parameter is likely to do. In the run described below, I am attempting to train the model on an individual using a second individual for the training data and a third individual for the validation data, but my goal is to use multiple individuals for both the training and validation sets, akin to the project described here.

Secondly, I've been having an issue lately where I submit scripts to my computing cluster, and though they are granted resources and produce a log file, the log file is empty after several days of the code running indicating no progress has been made or that the program has even initialized. I am also asking my cluster resources about this, as I suspect it is more likely an issue with resource allocation, but I would also very much appreciate if someone could take a glance at the code I am submitting to make sure there are no obvious causes for this in the deepvariant commands that I'm just completely missing.

Thank you very much!

Best,

Haley

Here is the code: `#!/bin/bash

SBATCH -p atlas

SBATCH --time=5-48:00:00 # walltime limit (HH:MM:SS)

SBATCH --nodes=1 # number of nodes

SBATCH --gpus-per-node=1 # 20 processor core(s) per node X 2 threads per core

SBATCH --partition=gpu-a100 # standard node(s)

SBATCH --ntasks=1

SBATCH --job-name="deepvariant_modeltraining"

SBATCH --mail-user=haley.arnold@usda.gov # email address

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH --mail-type=FAIL

SBATCH --output="deepvariant_modeltrain-%j-%N.out" # job standard output file (%j replaced by job id)

SBATCH --error="deepvariant_modeltrain-%j-%N.err" # job standard error file (%j replaced by job id)

SBATCH --account=ag100pest

LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE

export PATH=$PATH:/project/ag100pest/sratoolkit/sratoolkit.2.10.9-centos_linux64/bin export PATH=$PATH:/project/ag100pest/sheina.sim/software/miniconda3/bin

export APPTAINER_CACHEDIR=$TMPDIR export APPTAINER_TMPDIR=$TMPDIR

condapath=/project/ag100pest/sheina.sim/condaenvs softwarepath=/project/ag100pest/sheina.sim/software slurmpath=/project/ag100pest/sheina.sim/slurm_scripts

module load apptainer

apptainer exec deepvariant_1.6.0.sif /opt/deepvariant/bin/train \ --config=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/dv_config.py:base \ --config.train_dataset_pbtxt="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/training_set_channelsize_F1F1shuffle.pbtxt" \ --config.tune_dataset_pbtxt="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/validation_set_channelsize_F1F2shuffled.pbtxt" \ --config.init_checkpoint=gs://deepvariant/models/DeepVariant/1.6.1/checkpoints/wgs/deepvariant.wgs.ckpt \ --config.num_epochs=10 \ --config.learning_rate=0.02 \ --config.num_validation_examples=0 \ --experiment_dir="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_fulltest/output/modeltrainout/fullindividualmodel" \ --strategy=mirrored \ --config.batch_size=32`

kishwarshafin commented 2 days ago

@helizabeth1103 ,

I do not have expertise in running things on clusters, I have never used cluster compute resources so I am not sure how helpful I can be to help you determine cluster issues. However, the command looks right to me independently. If the paths are mounted properly, it should be able to train a model without any problem.

helizabeth1103 commented 2 days ago

Thank you, it turned out to be an issue with the version of apptainer I was using, but I still very much appreciate you taking a look for me!

Do you have any recommendations for how to determine good starting points for parameters in model training?

Thank you!

google / deepvariant

Follow up to issue 797 #840