kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
148 stars 23 forks source link

Question about pre-training #11

Closed leannmlindsey closed 5 months ago

leannmlindsey commented 6 months ago

Thank you for providing such a well-documented model with all of the pretraining code!

I am currently using the provided slurm scripts to pre-train and I had a few questions.

You said in a previous issue that the models in the paper (1.9M parameters and 470k parameters) were trained for 10k steps. When I look at the log files, I see the following metrics for the 1.9M parameter model)

start (after 1 step): val/loss=0.970 val/perplexity=2.640 test/loss=0.964 test/perplexity=2.620

current (10268 steps): val/loss=0.956 val/perplexity=2.600 test/loss=0.951 test/perplexity=2.590

I don't see a big difference in the loss or perplexity between the model at step 1 and 10k steps. Is this consistent with what you also saw in your training? How did you decide to stop training at 10k steps?

yair-schiff commented 6 months ago

Looking at my training runs, I have that validation loss @ 2k steps is 0.997 and at @10k steps it is 0.9521, which is a meaningful reduction. I don't believe I ever evaluated after 1 step so I don't have that number to compare to directly, but I am somewhat surprised by the value of 0.970 you report, which essentially corresponds to near random initialization.

10k steps was chosen based on speed of experimental iteration and to be somewhat comparable to HyenaDNA for similar sized models

leannmlindsey commented 6 months ago

Thank you for your quick reply.

Yes, I was also surprised. Perhaps I am misinterpreting the log files in the watch_folder directory? I was looking at the very first logged value and assuming that was "step 1"

yair-schiff commented 6 months ago

I think the first logged eval values should correspond to max_steps // 5 steps of training. If you can provide a bit more information about the command you used to launch the pre-training experiment and some of the log outputs as well, we can try to verify this.

leannmlindsey commented 6 months ago

I used the run_pretrain_caduceus.sh script provided.
I was wondering if perhaps it was loading a pre-trained model instead of a randomly initialized model. Could that be the case?

I made the following changes to run in our environment:

#SBATCH --get-user-env                      # Retrieve the users login environment
#SBATCH --account=soc-gpu-np
#SBATCH --partition=soc-gpu-np
#SBATCH --qos=soc-gpulong-np
#SBATCH -t 72:00:00                         # Time limit (hh:mm:ss)
#SBATCH --mem=100G                          # RAM
#SBATCH --gres=gpu:a6000:8                        # Number of GPUs
#SBATCH --ntasks-per-node=8                 # Should correspond to num devices (at least 1-1 task to GPU)
##SBATCH --cpus-per-task=4                   # Number of CPU cores per task
#SBATCH -N 1                                # Number of nodes
#SBATCH --requeue                           # Requeue job if it fails
#SBATCH --job-name=caduceus_ps              # Job name
#SBATCH --output=../watch_folder/%x_%j.log  # Log file
#SBATCH --open-mode=append                  # Do not overwrite logs

module load cuda
nvidia-smi
source activate CADUCEUS_3

# Setup environment
#cd ../ || exit  # Go to the root directory of the repo
#source setup_env.sh
cd /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/MODELS/CADUCEUS/caduceus
export HYDRA_FULL_ERROR=1

NUM_DEVICES=8

# Run script
SEQLEN=1024
MAX_STEPS=10000
D_MODEL=256
N_LAYER=4
LR="8e-3"
BIDIRECTIONAL_STRATEGY="add"
BIDIRECTIONAL_WEIGHT_TIE="true"
RCPS="true"
RC_AUG="false"

BATCH_SIZE=$(( 1048576 / SEQLEN ))
SEQLEN_DIS="$(echo "scale=0; ${SEQLEN} / 1000" | bc)k"
WANDB_NAME="caduceus_ps_seqlen-${SEQLEN_DIS}_d_model-${D_MODEL}_n_layer-${N_LAYER}_lr-${LR}"
HYDRA_RUN_DIR="./outputs/pretrain/hg38/${WANDB_NAME}"

mkdir -p "${HYDRA_RUN_DIR}"
srun python -m train \
  experiment=hg38/hg38 \
  callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \
  dataset.max_length=${SEQLEN} \
  dataset.batch_size=$(( BATCH_SIZE / NUM_DEVICES )) \
  dataset.mlm=true \
  dataset.mlm_probability=0.15 \
  dataset.rc_aug="${RC_AUG}" \
  model="caduceus" \
  model.config.d_model=${D_MODEL} \
  model.config.n_layer=${N_LAYER} \
  model.config.bidirectional=true \
  model.config.bidirectional_strategy=${BIDIRECTIONAL_STRATEGY} \
  model.config.bidirectional_weight_tie=${BIDIRECTIONAL_WEIGHT_TIE} \
  model.config.rcps=${RCPS} \
  optimizer.lr="${LR}" \
  train.global_batch_size=${BATCH_SIZE} \
  trainer.max_steps=${MAX_STEPS} \
  trainer.devices=${NUM_DEVICES} \
  +trainer.val_check_interval=$(( MAX_STEPS / 5 )) \
  wandb.group=pretrain_hg38 \
  wandb.name="${WANDB_NAME}" \
  hydra.run.dir="${HYDRA_RUN_DIR}"
yair-schiff commented 5 months ago

I was wondering if perhaps it was loading a pre-trained model instead of a randomly initialized model. Could that be the case?

Unless you have a checkpoint file in the HYDRA_RUN_DIR under checkpoints/last.ckpt this should not be loading any pre-trained models.

Based on the above, I believe the first time this logs val/test metrics is after 2k training steps. The different hyperparameters for the model (e.g. N_layers) that you used could account for why you see different metrics than what I posted. In any case, I think the validation loss curve you describe is reasonable.