Trainning error - Githubissues

deep-practice commented 2 months ago

Traceback (most recent call last): File "train_v2.py", line 267, in main(parser.parse_args()) File "train_v2.py", line 185, in main img, local_labels = adversarial_img_warping(backbone=backbone, File "/data/work/project/ARoFace/AdvWarp.py", line 86, in adversarial_img_warping train_img = torch.cat((img[idx1], updated_img[idx2]), dim=0) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

msed-Ebrahimi commented 2 months ago

Hi, sorry for the confusion. In train_v2.py line 106 please comment:

backbone._set_static_graph()

Also, please make sure to use proper CUDA configurations.

deep-practice commented 2 months ago

After commenting "backbone._set_static_graph()",it failed too

Training: 2024-07-27 21:52:33,756-Speed 218.09 samples/sec Loss nan LearningRate 0.010000 Epoch: 0 Global Step: 150 Fp16 Grad Scale: 256 Required: 1658 hours Training: 2024-07-27 21:52:45,407-Speed 219.74 samples/sec Loss 44.0782 LearningRate 0.010000 Epoch: 0 Global Step: 160 Fp16 Grad Scale: 128 Required: 1631 hours Training: 2024-07-27 21:52:57,051-Speed 219.87 samples/sec Loss 44.1238 LearningRate 0.010000 Epoch: 0 Global Step: 170 Fp16 Grad Scale: 128 Required: 1601 hours Traceback (most recent call last): File "train_v2.py", line 267, in main(parser.parse_args()) File "train_v2.py", line 185, in main img, local_labels = adversarial_img_warping(backbone=backbone, File "/data/work/project/ARoFace/AdvWarp.py", line 86, in adversarial_img_warping train_img = torch.cat((img[idx1], updated_img[idx2]), dim=0) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

msed-Ebrahimi commented 2 months ago

what is the Pytorch and CUDA version that you are using?

deep-practice commented 2 months ago

NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 Pytorch:1.13.1+cu116

deep-practice commented 2 months ago

Besides,I can train normally on this machine using InsightFace

msed-Ebrahimi commented 2 months ago

Besides,I can train normally on this machine using InsightFace

Please try the following settings: Python 3.7 Pytorch 1.8 Cuda 11.1

msed-Ebrahimi / ARoFace

Trainning error #2