htyao89 / Textual-based_Class-aware_prompt_tuning

MIT License
13 stars 1 forks source link

Question about Loss is infinite or NaN #3

Open Lilzhuzixi opened 1 month ago

Lilzhuzixi commented 1 month ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done
htyao89 commented 1 month ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?

image

Lilzhuzixi commented 1 month ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN! This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?

image

yes,I have adjusted eps to 1e-3.

htyao89 commented 1 month ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN! This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer? image

yes,I have adjusted eps to 1e-3.

During my experiments, the loss NaN is all caused by the Adam optimizer. Can you provide the log file?