lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.33k stars 546 forks source link

FATAL ERROR: Failed test assert: buf.result->policyProbs[buf.result->getPos(Location::ofString("E16",board),board)] >= 0.95 #908

Open ntkylin opened 3 months ago

ntkylin commented 3 months ago

Hi, katago contribution routines crashed in my Windows11 22H2 system after about 2700 games training/rating with following error:

FATAL ERROR:
Failed test assert: buf.result->policyProbs[buf.result->getPos(Location::ofString("E16",board),board)] >= 0.95
file: C:\Data\Data\Coding\Python\KataGo\cpp\tests\testnnevalcanary.cpp
line: 39

Unfortunately I cannot find that cpp file on the above path, do you know what's wrong?

GPU is AMD 7900GRE, with driver version Adrenalin 23.7.2 (WHQL Recommended). And the log file is show below:

2024-03-04 10:56:34+0800: KataGo v1.14.0
2024-03-04 10:56:34+0800: Git revision: c6de1bbda837a0717eaeca46102f7326ed0da0d4
2024-03-04 10:56:34+0800: Running tiny net to sanity-check that GPU is working
2024-03-04 10:56:34+0800: nnRandSeed0 = 12039840580933599473
2024-03-04 10:56:34+0800: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyModel_50A5C2126CEFD39F.bin.gz useFP16 auto useNHWC auto
2024-03-04 10:56:34+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-04 10:56:35+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:35+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-04 10:56:35+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:35+0800: Found OpenCL Device 1: gfx90c (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:35+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:35+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3570.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-04 10:56:35+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c16_mv9.txt
2024-03-04 10:56:40+0800: OpenCL backend thread 0: Model version 9
2024-03-04 10:56:40+0800: OpenCL backend thread 0: Model name: rect15-b2c16-s13679744-d94886722
2024-03-04 10:56:40+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-04 10:56:40+0800: nnRandSeed0 = 16268540014641288193
2024-03-04 10:56:40+0800: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_BE826E02F2693DD5.bin.gz useFP16 auto useNHWC auto
2024-03-04 10:56:40+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-04 10:56:40+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:40+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-04 10:56:40+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:40+0800: Found OpenCL Device 1: gfx90c (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:40+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:40+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3570.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-04 10:56:40+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c6_mv11.txt
2024-03-04 10:56:44+0800: OpenCL backend thread 0: Model version 11
2024-03-04 10:56:44+0800: OpenCL backend thread 0: Model name: b1c6nbt
2024-03-04 10:56:44+0800: OpenCL backend thread 0: FP16Storage true FP16Compute true FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-04 10:56:44+0800: GPU -1 finishing, processed 41 rows 21 batches
2024-03-04 10:56:44+0800: nnRandSeed0 = 4382101984213418997
2024-03-04 10:56:44+0800: After dedups: nnModelFile0 = katago_contribute/kata1/tmpTinyMishModel_8C21A42B419119F2.bin.gz useFP16 auto useNHWC auto
2024-03-04 10:56:44+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-04 10:56:44+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:44+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-04 10:56:44+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:44+0800: Found OpenCL Device 1: gfx90c (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:44+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:44+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3570.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-04 10:56:44+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c6_mv11.txt
2024-03-04 10:56:49+0800: OpenCL backend thread 0: Model version 11
2024-03-04 10:56:49+0800: OpenCL backend thread 0: Model name: b1c6nbt
2024-03-04 10:56:49+0800: OpenCL backend thread 0: FP16Storage true FP16Compute true FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-04 10:56:49+0800: GPU -1 finishing, processed 41 rows 21 batches
2024-03-04 10:56:49+0800: Tiny net sanity check complete
2024-03-04 10:56:49+0800: GPU -1 finishing, processed 41 rows 21 batches
2024-03-04 10:56:49+0800: Performing autotuning for ALL neural net configurations needed for the run!
2024-03-04 10:56:49+0800: *** If this has not already been done, it may take some time, please be patient ***
2024-03-04 10:56:49+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:49+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-04 10:56:49+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:49+0800: Found OpenCL Device 1: gfx90c (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:56:49+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:56:49+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3570.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c96_mv8.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c128_mv8.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c192_mv8.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c256_mv8.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c256_mv10.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c320_mv10.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c384_mv11.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c384_mv14.txt
2024-03-04 10:56:49+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c512_mv15.txt
2024-03-04 10:56:49+0800: All neural net configs autotuned
2024-03-04 10:56:49+0800: --------
2024-03-04 10:56:49+0800: Type 'pause' and hit enter to pause contribute and CPU and GPU usage.
2024-03-04 10:56:49+0800: Type 'quit' and hit enter to begin shutdown, quitting after all current games are done (may take a long while).
2024-03-04 10:56:49+0800: Type 'forcequit' and hit enter to shutdown and quit more quickly, but lose all unfinished game data.
2024-03-04 10:56:49+0800: --------
2024-03-04 10:57:12+0800: Number of nets loaded: selfplay 0 rating 0
2024-03-04 10:57:13+0800: Found new neural net kata1-b18c384nbt-s9462441216-d4173921862
2024-03-04 10:57:14+0800: nnRandSeed0 = 12755863227358558861
2024-03-04 10:57:14+0800: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b18c384nbt-s9462441216-d4173921862.bin.gz useFP16 auto useNHWC auto
2024-03-04 10:57:14+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-04 10:57:15+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:57:15+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-04 10:57:15+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:57:15+0800: Found OpenCL Device 1: gfx90c (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:57:15+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:57:15+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3570.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-04 10:57:15+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c384_mv14.txt
2024-03-04 10:57:19+0800: Maybe predownloading model...
2024-03-04 10:57:20+0800: OpenCL backend thread 0: Model version 14
2024-03-04 10:57:20+0800: OpenCL backend thread 0: Model name: kata1-b18c384nbt-s9462441216-d4173921862
2024-03-04 10:57:21+0800: OpenCL backend thread 0: FP16Storage true FP16Compute true FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-04 10:57:21+0800: Loaded latest neural net kata1-b18c384nbt-s9462441216-d4173921862 from: katago_contribute/kata1/models/kata1-b18c384nbt-s9462441216-d4173921862.bin.gz
2024-03-04 10:57:21+0800: nnRandSeed0 = 14097770897088673486
2024-03-04 10:57:21+0800: After dedups: nnModelFile0 = katago_contribute/kata1/models/kata1-b18c384nbt-s9462441216-d4173921862.bin.gz useFP16 auto useNHWC auto
2024-03-04 10:57:21+0800: Initializing neural net buffer to be size 19 * 19 allowing smaller boards
2024-03-04 10:57:22+0800: Found OpenCL Platform 0: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:57:22+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2024-03-04 10:57:22+0800: Found OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:57:22+0800: Found OpenCL Device 1: gfx90c (Advanced Micro Devices, Inc.) (score 11000200)
2024-03-04 10:57:22+0800: Creating context for OpenCL Platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.) (OpenCL 2.1 AMD-APP (3570.0))
2024-03-04 10:57:22+0800: Using OpenCL Device 0: gfx1100 (Advanced Micro Devices, Inc.) OpenCL 2.0 AMD-APP (3570.0) (Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_copy_buffer_p2p cl_amd_planar_yuv)
2024-03-04 10:57:22+0800: Loaded tuning parameters from: D:\katago-v1.14.0-opencl-windows-x64/KataGoData/opencltuning/tune11_gpugfx1100_x19_y19_c384_mv14.txt
2024-03-04 10:57:27+0800: OpenCL backend thread 0: Model version 14
2024-03-04 10:57:27+0800: OpenCL backend thread 0: Model name: kata1-b18c384nbt-s9462441216-d4173921862
2024-03-04 10:57:28+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false FP16TensorCoresFor1x1 false
2024-03-04 10:57:28+0800: Testing loaded net
2024-03-04 10:57:50+0800: Using unbatched fp32 as the reference values
2024-03-04 10:57:51+0800: Warning: large FP16 errors, using FP32 instead
2024-03-04 10:57:51+0800: GPU -1 finishing, processed 1338 rows 1004 batches
2024-03-04 10:57:51+0800: Loaded new neural net kata1-b18c384nbt-s9462441216-d4173921862
2024-03-04 10:57:51+0800: Starting game 0 (training) (kata1-b18c384nbt-s9462441216-d4173921862)
2024-03-04 10:57:51+0800: Starting game 1 (training) (kata1-b18c384nbt-s9462441216-d4173921862)
...
2024-03-04 11:00:58+0800: Starting game 34 (training) (kata1-b18c384nbt-s9462441216-d4173921862)
2024-03-04 11:00:58+0800: Starting game 35 (training) (kata1-b18c384nbt-s9462441216-d4173921862)
2024-03-04 11:01:22+0800: Number of nets loaded: selfplay 1 rating 0
2024-03-04 11:01:22+0800: Performance: in the last 141.3 seconds, played 223 moves (1.6/sec) and 88397 nn evals (625.682151/sec)
2024-03-04 11:12:15+0800: Finished game 8 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b18c384nbt-s9462441216-d4173921862/173FCA854092C95F.sgf and training data katago_contribute/kata1/tdata/kata1-b18c384nbt-s9462441216-d4173921862/B6B2D011A8150E70.npz (22 rows)
2024-03-04 11:12:15+0800: Starting game 36 (training) (kata1-b18c384nbt-s9462441216-d4173921862)
2024-03-04 11:12:15+0800: Performance: in the last 652.9 seconds, played 1687 moves (2.6/sec) and 443344 nn evals (678.999658/sec)
2024-03-04 11:16:11+0800: Finished game 6 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b18c384nbt-s9462441216-d4173921862/DAEF60F45484BF39.sgf and training data katago_contribute/kata1/tdata/kata1-b18c384nbt-s9462441216-d4173921862/DEAB8ABD8F59D129.npz (24 rows)
......
2024-03-08 03:02:15+0800: Starting game 2681 (training) (kata1-b18c384nbt-s9492280320-d4181591514)
2024-03-08 03:02:15+0800: Performance: in the last 156.6 seconds, played 247 moves (1.6/sec) and 102159 nn evals (652.318832/sec)
2024-03-08 03:02:54+0800: Finished game 2635 (training), uploaded sgf katago_contribute/kata1/sgfs/kata1-b18c384nbt-s9492280320-d4181591514/02EDCAEB7604D305.sgf and training data katago_contribute/kata1/tdata/kata1-b18c384nbt-s9492280320-d4181591514/B29C186AD102145F.npz (68 rows)
2024-03-08 03:02:54+0800: Starting game 2682 (training) (kata1-b18c384nbt-s9492280320-d4181591514)
2024-03-08 03:03:16+0800: Number of nets loaded: selfplay 1 rating 0
ntkylin commented 3 months ago

additionally, by crashing happened, the GPU seems full loaded with a full speed of its cooling fan. After reboot, the system cannot be correctly start just can only show BIOS interface. When shutdown manually and restart again, it can enter the Windows system, but the driver of GPU2=7900GRE is gone.

lightvector commented 3 months ago

Thanks for the report and thanks for the contributions! Unfortunately, it sounds like your GPU may have malfunctioned. The path mentioned in an error is a path to the source code on my machine for help debugging, which is expected since that's where it was compiled. The error indicates that your GPU (possibly due to overheating, or just a random unpredictable failure) might have started to return incorrect numbers during its computation.

I'm really sorry for your trouble - if you get your system working again, I would recommend against contributing further using that machine, to avoid stressing the GPU, and because if the GPU starts to return incorrect values after running enough time it might start to result in low-quality data that isn't useful for training anyways.