Question about Loss is infinite or NaN

Lilzhuzixi commented 6 months ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.

#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

htyao89 commented 6 months ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done

Do you have to adjust the EPS in the optimizer?

Lilzhuzixi commented 6 months ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN! This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done
Do you have to adjust the EPS in the optimizer?

yes，I have adjusted eps to 1e-3.

htyao89 commented 6 months ago

Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN! This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0

CFG=vit_b16_ep100_ctxv1
CTP=end  # class token position (end or middle)
NCTX=4  # number of context tokens
SHOTS=16  # number of shots (1, 2, 4, 8, 16)
CSC=False  # class-specific context (False or True)
FOLDER=output_flowers

for SEED in 1 2 3
do
    DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Results are available in ${DIR}. Skip this job"
    else
        echo "Run this job and save the output to ${DIR}"
        set CUDA_VISIBLE_DEVICES=0
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/oxford_flowers.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.W ${WEIGHT} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS} \
        DATASET.SUBSAMPLE_CLASSES base
    fi
done
Do you have to adjust the EPS in the optimizer?
yes，I have adjusted eps to 1e-3.

During my experiments, the loss NaN is all caused by the Adam optimizer. Can you provide the log file?

jianai13579 commented 4 months ago

I got around to the same problem even though I modified the eps.The log file has not been generated. Traceback (most recent call last): File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 238, in main(args) File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 165, in main trainer.train() File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 386, in train super().train(self.start_epoch, self.max_epoch) File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 250, in train self.run_epoch() File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 597, in run_epoch loss_summary = self.forward_backward(batch) File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 302, in model_backward_and_update self.model_backward(loss) File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 291, in model_backward self.detect_anomaly(loss) File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 223, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!

jianai13579 commented 4 months ago

I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through

Lilzhuzixi commented 4 months ago

That's great！Thank you very much. I will also try adding this code when I go back and see if it works.

? @.***

------------------ 原始邮件 ------------------ 发件人: "htyao89/Textual-based_Class-aware_prompt_tuning" @.>; 发送时间: 2024年7月30日(星期二) 凌晨0:06 @.>; @.**@.>; 主题: Re: [htyao89/Textual-based_Class-aware_prompt_tuning] Question about Loss is infinite or NaN (Issue #3)

I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

htyao89 / Textual-based_Class-aware_prompt_tuning

Question about Loss is infinite or NaN #3