Closed ACGhanchi closed 1 hour ago
我想试一下100种已知字体,5种未知字体,每个字体2500个字符,我只修改了02a_run_ddp.sh中的 --output_k mkdir output mkdir output/models
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \ --nproc_per_node=1 --use_env main.py \ --img_size 80 \ --data_path data/imgs/Seen240_S80F50_TRAIN800 \ --lr 1e-4 \ --output_k 100 \ --batch_size 16 \ --iters 1000 \ --epoch 200 \ --val_num 10 \ --baseline_idx 0 \ --save_path output/models \ --model_name B0_K240BS32I1000E200_LR1e-4-wdl0.01 \ --ddp \ --wdl --w_wdl 0.01 \ --no_val
但运行sh scripts/02a_run_ddp.sh时出错,RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches) 不知道怎么设置或者解决(显存是24G,pytorch是1.8.0)
cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
好像是--ddp的问题,注释掉能运行了
我想试一下100种已知字体,5种未知字体,每个字体2500个字符,我只修改了02a_run_ddp.sh中的 --output_k mkdir output mkdir output/models
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \ --nproc_per_node=1 --use_env main.py \ --img_size 80 \ --data_path data/imgs/Seen240_S80F50_TRAIN800 \ --lr 1e-4 \ --output_k 100 \
--batch_size 16 \ --iters 1000 \ --epoch 200 \ --val_num 10 \ --baseline_idx 0 \ --save_path output/models \ --model_name B0_K240BS32I1000E200_LR1e-4-wdl0.01 \ --ddp \ --wdl --w_wdl 0.01 \ --no_val
--load_model CF-Font/output/models/logs/B0_K240BS32I1000E200_LR1e-4-wdl0.01_20230426-233306
但运行sh scripts/02a_run_ddp.sh时出错,RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
不知道怎么设置或者解决(显存是24G,pytorch是1.8.0)