Open Lilzhuzixi opened 6 months ago
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.
#!/bin/bash # custom config DATA=DATA TRAINER=TCP WEIGHT=1.0 CFG=vit_b16_ep100_ctxv1 CTP=end # class token position (end or middle) NCTX=4 # number of context tokens SHOTS=16 # number of shots (1, 2, 4, 8, 16) CSC=False # class-specific context (False or True) FOLDER=output_flowers for SEED in 1 2 3 do DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED} if [ -d "$DIR" ]; then echo "Results are available in ${DIR}. Skip this job" else echo "Run this job and save the output to ${DIR}" set CUDA_VISIBLE_DEVICES=0 python train.py \ --root ${DATA} \ --seed ${SEED} \ --trainer ${TRAINER} \ --dataset-config-file configs/datasets/oxford_flowers.yaml \ --config-file configs/trainers/${TRAINER}/${CFG}.yaml \ --output-dir ${DIR} \ TRAINER.COOP.N_CTX ${NCTX} \ TRAINER.COOP.CSC ${CSC} \ TRAINER.COOP.W ${WEIGHT} \ TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \ DATASET.NUM_SHOTS ${SHOTS} \ DATASET.SUBSAMPLE_CLASSES base fi done
Do you have to adjust the EPS in the optimizer?
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN! This is base2new_train_flowers.sh.
#!/bin/bash # custom config DATA=DATA TRAINER=TCP WEIGHT=1.0 CFG=vit_b16_ep100_ctxv1 CTP=end # class token position (end or middle) NCTX=4 # number of context tokens SHOTS=16 # number of shots (1, 2, 4, 8, 16) CSC=False # class-specific context (False or True) FOLDER=output_flowers for SEED in 1 2 3 do DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED} if [ -d "$DIR" ]; then echo "Results are available in ${DIR}. Skip this job" else echo "Run this job and save the output to ${DIR}" set CUDA_VISIBLE_DEVICES=0 python train.py \ --root ${DATA} \ --seed ${SEED} \ --trainer ${TRAINER} \ --dataset-config-file configs/datasets/oxford_flowers.yaml \ --config-file configs/trainers/${TRAINER}/${CFG}.yaml \ --output-dir ${DIR} \ TRAINER.COOP.N_CTX ${NCTX} \ TRAINER.COOP.CSC ${CSC} \ TRAINER.COOP.W ${WEIGHT} \ TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \ DATASET.NUM_SHOTS ${SHOTS} \ DATASET.SUBSAMPLE_CLASSES base fi done
Do you have to adjust the EPS in the optimizer?
yes,I have adjusted eps to 1e-3.
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN! This is base2new_train_flowers.sh.
#!/bin/bash # custom config DATA=DATA TRAINER=TCP WEIGHT=1.0 CFG=vit_b16_ep100_ctxv1 CTP=end # class token position (end or middle) NCTX=4 # number of context tokens SHOTS=16 # number of shots (1, 2, 4, 8, 16) CSC=False # class-specific context (False or True) FOLDER=output_flowers for SEED in 1 2 3 do DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED} if [ -d "$DIR" ]; then echo "Results are available in ${DIR}. Skip this job" else echo "Run this job and save the output to ${DIR}" set CUDA_VISIBLE_DEVICES=0 python train.py \ --root ${DATA} \ --seed ${SEED} \ --trainer ${TRAINER} \ --dataset-config-file configs/datasets/oxford_flowers.yaml \ --config-file configs/trainers/${TRAINER}/${CFG}.yaml \ --output-dir ${DIR} \ TRAINER.COOP.N_CTX ${NCTX} \ TRAINER.COOP.CSC ${CSC} \ TRAINER.COOP.W ${WEIGHT} \ TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \ DATASET.NUM_SHOTS ${SHOTS} \ DATASET.SUBSAMPLE_CLASSES base fi done
Do you have to adjust the EPS in the optimizer?
yes,I have adjusted eps to 1e-3.
During my experiments, the loss NaN is all caused by the Adam optimizer. Can you provide the log file?
I got around to the same problem even though I modified the eps.The log file has not been generated.
Traceback (most recent call last):
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 238, in
I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through
That's great!Thank you very much. I will also try adding this code when I go back and see if it works.
? @.***
------------------ 原始邮件 ------------------ 发件人: "htyao89/Textual-based_Class-aware_prompt_tuning" @.>; 发送时间: 2024年7月30日(星期二) 凌晨0:06 @.>; @.**@.>; 主题: Re: [htyao89/Textual-based_Class-aware_prompt_tuning] Question about Loss is infinite or NaN (Issue #3)
I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in
main(args)
File "train.py", line 165, in main
trainer.train()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train
self.run_epoch()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update
self.model_backward(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward
self.detect_anomaly(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.