Open Hologos opened 2 years ago
Thanks for the correction.
I think there are two possible ways to improve katago performance on MacOS.
The first way is straightforward. Create backend functions with low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders. However, this way could be error-prone.
The second way is Core ML framework. Convert a katago network to a Core ML model, and redirect NeuralNet::getOutput()
to the Core ML model. This way could be easier than the first way, but it needs additional works.
I spent some time studying how to do the second way, but I found some difficulties.
First, a katago network is formed as a bin.gz
file. However, the Core ML Tools need its original Tensorflow format. I believe this problem could be resolved if katago can release a network with the original Tensorflow format.
Second, katago's python scripts under python/
seem to be based on Tensorflow 1. However, it is impossible to install Tensorflow 1 in MacOS M1 (see Apple's reply), so it is difficult to use katago's python scripts to do anything on MacOS M1.
We will be very likely to write a python script that converts a katago network to a Core ML model, so knowing how a katago network was generated from katago's script is important.
Third, I am not sure how to redirect NeuralNet::getOutput()
to a Core ML model, because Core ML framework seems to only support Objective-C and Swift programming language. Though I have been able to build katago in Xcode, I don't have any experience on mixing C++ and Objective-C/Swift files in a project.
A quick update here.
The first way might not utilize Apple Neural Engine (ANE), so its performance could not be best.
The second way utilizes ANE if the model configuration is set to MLComputeUnits.all or . cpuAndNeuralEngine.
All of the difficulties I mentioned have been resolved now.
I used an Intel-based Macbook to install Tensorflow 1.15, and I ran KataGo's python scripts to loaded a saved model. Then, I wrote some python code to convert the saved model to a Core ML model, and shared the model with my Macbook Pro M1.
On Macbook Pro M1, I created an Xcode project from KataGo. Then, I imported the Core ML model into the Xcode project. Finally, I made bridges between C++ and Objective-C/Swift.
It resolved all of the difficulties I met, but I haven't completed network input/output between C++ and Objective-C/Swift. It looks much easier to me, but it could be too complex to complete in a few days.
I have written an article to describe KataGo CoreML backend on my page.
My test results show that CoreML does not use much GPU. In fact, I see 0% GPU utilization. So, I think it is still valuable to make a Metal backend to utilize GPU on MacOS.
@ChinChangYang Would you please commit the KataGoModel.mlpackage
file into your branch?
@ChinChangYang Would you please commit the
KataGoModel.mlpackage
file into your branch?
I uploaded KataGoModel.mlpackage
to my release page.
@ChinChangYang After building the Xcode project following your page, katago terminates with the following error:
% ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg
libc++abi: terminating with uncaught exception of type StringError: -model MODELFILENAME.bin.gz was not specified to tell KataGo where to find the neural net model, and default was not found at /Users/ohho/.katago/default_model.bin.gz
zsh: abort ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg
Did I miss anything?
For reference, the original katago runs as:
~ % which katago
/opt/homebrew/bin/katago
~ % /opt/homebrew/bin/katago gtp -config $(brew list --verbose katago | grep gtp_example.cfg) -model $(brew list --verbose katago | grep .gz | head -1)
KataGo v1.11.0
Using TrompTaylor rules initially, unless GTP/GUI overrides this
Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
Initializing board with boardXSize 19 boardYSize 19
Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Model name: g170-b30c320x2-s4824661760-d1229536699
GTP ready, beginning main protocol loop
genmove b
= Q16
The coreml version of katago will run with a model specified, with the numNNServerThreadsPerModel
set to 1
. I am not sure whether this is a M1 (MacBook Air) specific setting:
./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
KataGo v1.11.0
Using TrompTaylor rules initially, unless GTP/GUI overrides this
Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
Initializing board with boardXSize 19 boardYSize 19
Loaded config ../cpp/configs/misc/coreml_example.cfg
Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Model name: g170-b30c320x2-s4824661760-d1229536699
GTP ready, beginning main protocol loop
genmove b
= Q16
The following is the output of using the original coreml_example.cfg
:
% ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
KataGo v1.11.0
Using TrompTaylor rules initially, unless GTP/GUI overrides this
libc++abi: terminating with uncaught exception of type StringError: Requested gpuIdx/device 1 was not found, valid devices range from 0 to 0
zsh: abort ./Release/katago gtp -config ../cpp/configs/misc/coreml_example.cfg -model
% ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
2022-08-23 16:56:19+0800: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 2
numSearchThreads = 3
openclDeviceToUseThread0 = 0
openclDeviceToUseThread1 = 1
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2022-08-23 16:56:19+0800: Loading model and initializing benchmark...
2022-08-23 16:56:19+0800: Testing with default positions for board size: 19
2022-08-23 16:56:19+0800: nnRandSeed0 = 12640316497824243737
2022-08-23 16:56:19+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 16:56:19+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 16:56:20+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 16:56:20+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 16:56:20+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
libc++abi: terminating with uncaught exception of type StringError: Requested gpuIdx/device 1 was not found, valid devices range from 0 to 0
zsh: abort ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg
It appears to be only 1 OpenCL device in your Apple M1, but I see 2 OpenCL Devices (Intel and Apple) in my Apple M1 Pro. I didn't consider your MacBook that only has an OpenCL device, so CoreML backend was not tested in this case.
If you really want to go deeper, you need to look into the source code by yourself.
The exception is thrown by the following code (openclhelpers.cpp):
for(size_t i = 0; i<gpuIdxsToUse.size(); i++) {
int gpuIdx = gpuIdxsToUse[i];
if(gpuIdx < 0 || gpuIdx >= allDeviceInfos.size()) {
if(allDeviceInfos.size() <= 0) {
throw StringError(
"No OpenCL devices were found on your system. If you believe you do have a GPU or other device with OpenCL installed, then your OpenCL installation or drivers may be buggy or broken or otherwise failing to detect your device."
);
}
throw StringError(
"Requested gpuIdx/device " + Global::intToString(gpuIdx) +
" was not found, valid devices range from 0 to " + Global::intToString((int)allDeviceInfos.size() - 1)
);
}
deviceIdsToUse.push_back(allDeviceInfos[gpuIdx].deviceId);
}
In your case, gpuIdxsToUse.size()
seems to be 2, but allDeviceInfos.size()
equals to 1. Therefore, gpuIdx = 1
was not found, because valid devices range from 0 to allDeviceInfos.size() - 1 = 0
.
I think there are two possible solutions.
The first solution is simple but dirty. You can try to modify the source code to bypass OpenCL Device 1 configuration, and replace OpenCL Device 1 by CoreML model.
That is, modify the following code in coremlbackend.cpp
:
void NeuralNet::getOutput(
ComputeHandle* gpuHandle,
InputBuffers* inputBuffers,
int numBatchEltsFilled,
NNResultBuf** inputBufs,
vector<NNOutput*>& outputs
) {
if (gpuHandle->handle->gpuIndex == 1) { /* modify this line!!! */
getOutputFromCoreML(gpuHandle, inputBuffers, numBatchEltsFilled, inputBufs, outputs);
}
else {
getOutputFromOpenCL(gpuHandle, inputBuffers, numBatchEltsFilled, inputBufs, outputs);
}
}
It should just work for your case, but I do not know how to bypass OpenCL device 1 configuration. The dirty code comes from this kind of bypass control.
The second solution is more formal. We need to design a new (parent) backend to run multiple (children) backends. The new backend works like a job scheduler. It can simply send a neural network job to a child backend in a round-robin strategy or a smarter strategy.
It should also work for your case, but it could be too complicated to be done in a few days.
Thanks for the information. I further benchmark with the following parameters in coreml_example.cfg
:
numNNServerThreadsPerModel = 1
openclDeviceToUseThread0 = 0
Output:
% ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
2022-08-23 18:22:56+0800: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 1
numSearchThreads = 3
openclDeviceToUseThread0 = 0
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2022-08-23 18:22:56+0800: Loading model and initializing benchmark...
2022-08-23 18:22:56+0800: Testing with default positions for board size: 19
2022-08-23 18:22:56+0800: nnRandSeed0 = 17168248613550848727
2022-08-23 18:22:56+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:22:56+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:22:57+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:22:57+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:22:57+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:22:57+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:22:57+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:22:57+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:22:57+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:22:57+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:22:59+0800: OpenCL backend thread 0: Device 0 FP16Storage true FP16Compute false FP16TensorCores false
2022-08-23 18:23:03+0800: Loaded config ../cpp/configs/misc/coreml_example.cfg
2022-08-23 18:23:03+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-08-23 18:23:03+0800: GPU 0 finishing, processed 5 rows 5 batches
2022-08-23 18:23:03+0800: nnRandSeed0 = 2804797857907270145
2022-08-23 18:23:03+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:23:03+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:23:04+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:23:04+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:23:04+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:23:04+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:23:04+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:23:04+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:23:04+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:23:04+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:23:06+0800: OpenCL backend thread 0: Device 0 FP16Storage true FP16Compute false FP16TensorCores false
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 205.74 nnEvals/s = 167.71 nnBatches/s = 67.38 avgBatchSize = 2.49 (39.1 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.80 nnEvals/s = 168.91 nnBatches/s = 28.57 avgBatchSize = 5.91 (39.6 secs)
numSearchThreads = 3: 10 / 10 positions, visits/s = 209.13 nnEvals/s = 169.23 nnBatches/s = 112.99 avgBatchSize = 1.50 (38.4 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 210.77 nnEvals/s = 168.91 nnBatches/s = 56.63 avgBatchSize = 2.98 (38.2 secs)
numSearchThreads = 2: 10 / 10 positions, visits/s = 207.31 nnEvals/s = 168.98 nnBatches/s = 168.98 avgBatchSize = 1.00 (38.6 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 210.54 nnEvals/s = 169.27 nnBatches/s = 84.90 avgBatchSize = 1.99 (38.1 secs)
numSearchThreads = 1: 10 / 10 positions, visits/s = 197.77 nnEvals/s = 166.13 nnBatches/s = 166.13 avgBatchSize = 1.00 (40.5 secs)
Ordered summary of results:
numSearchThreads = 1: 10 / 10 positions, visits/s = 197.77 nnEvals/s = 166.13 nnBatches/s = 166.13 avgBatchSize = 1.00 (40.5 secs) (EloDiff baseline)
numSearchThreads = 2: 10 / 10 positions, visits/s = 207.31 nnEvals/s = 168.98 nnBatches/s = 168.98 avgBatchSize = 1.00 (38.6 secs) (EloDiff +11)
numSearchThreads = 3: 10 / 10 positions, visits/s = 209.13 nnEvals/s = 169.23 nnBatches/s = 112.99 avgBatchSize = 1.50 (38.4 secs) (EloDiff +8)
numSearchThreads = 4: 10 / 10 positions, visits/s = 210.54 nnEvals/s = 169.27 nnBatches/s = 84.90 avgBatchSize = 1.99 (38.1 secs) (EloDiff +4)
numSearchThreads = 5: 10 / 10 positions, visits/s = 205.74 nnEvals/s = 167.71 nnBatches/s = 67.38 avgBatchSize = 2.49 (39.1 secs) (EloDiff -11)
numSearchThreads = 6: 10 / 10 positions, visits/s = 210.77 nnEvals/s = 168.91 nnBatches/s = 56.63 avgBatchSize = 2.98 (38.2 secs) (EloDiff -8)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.80 nnEvals/s = 168.91 nnBatches/s = 28.57 avgBatchSize = 5.91 (39.6 secs) (EloDiff -56)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 1: (baseline)
numSearchThreads = 2: +11 Elo (recommended)
numSearchThreads = 3: +8 Elo
numSearchThreads = 4: +4 Elo
numSearchThreads = 5: -11 Elo
numSearchThreads = 6: -8 Elo
numSearchThreads = 12: -56 Elo
If you care about performance, you may want to edit numSearchThreads in ../cpp/configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../cpp/configs/misc/coreml_example.cfg
2022-08-23 18:27:39+0800: GPU 0 finishing, processed 45892 rows 26752 batches
xcode % ./Release/katago benchmark -config ../cpp/configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
2022-08-23 18:29:05+0800: Running with following config:
allowResignation = true
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 1
numSearchThreads = 3
openclDeviceToUseThread0 = 0
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2022-08-23 18:29:05+0800: Loading model and initializing benchmark...
2022-08-23 18:29:05+0800: Testing with default positions for board size: 19
2022-08-23 18:29:05+0800: nnRandSeed0 = 3967709193022350012
2022-08-23 18:29:05+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:29:05+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:29:06+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:29:06+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:29:06+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:29:06+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:29:06+0800: Performing autotuning
2022-08-23 18:29:06+0800: *** On some systems, this may take several minutes, please be patient ***
2022-08-23 18:29:06+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:29:06+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:29:06+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:06+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M1 modelVersion 8 channels 320
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 733.912 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 1/56 Calls/sec 735.602 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 40474.6 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 20/56 ...
Tuning 40/56 ...
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1173.42 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 1199.21 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 9592.37 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 13249.2 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 5/70 Calls/sec 14150.4 L2Error 0 MWG=32 NWG=32 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 6/70 Calls/sec 16242.3 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 10/70 Calls/sec 17242.3 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 11/70 Calls/sec 17563.5 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 23/70 Calls/sec 20887.3 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 40/70 ...
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1981.24 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 19940.6 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 14/70 Calls/sec 21231.3 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 18/70 Calls/sec 21644.9 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 21/70 Calls/sec 22117.8 L2Error 0 MWG=32 NWG=32 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 40/70 ...
Tuning 60/70 ...
Tuning 62/70 Calls/sec 22138 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
FP16 storage not significantly faster, not enabling on its own
------------------------------------------------------
Using FP32 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 40962.7 L2Error 0 transLocalSize0=1 transLocalSize1=1
Tuning 1/47 Calls/sec 43588 L2Error 0 transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 222612 L2Error 0 transLocalSize0=16 transLocalSize1=1
Tuning 10/47 Calls/sec 234950 L2Error 0 transLocalSize0=64 transLocalSize1=4
Tuning 20/47 ...
Tuning 40/47 ...
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 52545.5 L2Error 0 untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 4/111 Calls/sec 124421 L2Error 0 untransLocalSize0=2 untransLocalSize1=2 untransLocalSize2=1
Tuning 20/111 ...
Tuning 40/111 ...
Tuning 60/111 ...
Tuning 64/111 Calls/sec 133565 L2Error 0 untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=1
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 278973 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 387898 L2Error 1.51315e-11 XYSTRIDE=2 CHANNELSTRIDE=32 BATCHSTRIDE=4
Tuning 3/106 Calls/sec 410722 L2Error 1.51315e-11 XYSTRIDE=2 CHANNELSTRIDE=32 BATCHSTRIDE=2
Tuning 4/106 Calls/sec 1.2236e+06 L2Error 1.0985e-11 XYSTRIDE=8 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 7/106 Calls/sec 1.66331e+06 L2Error 1.08218e-11 XYSTRIDE=16 CHANNELSTRIDE=16 BATCHSTRIDE=1
Tuning 9/106 Calls/sec 1.69265e+06 L2Error 1.08218e-11 XYSTRIDE=16 CHANNELSTRIDE=1 BATCHSTRIDE=2
Tuning 20/106 ...
Tuning 21/106 Calls/sec 1.88661e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=2
Tuning 32/106 Calls/sec 1.91378e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=2
Tuning 60/106 ...
Tuning 80/106 ...
Tuning 87/106 Calls/sec 1.92386e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 100/106 ...
Done tuning
------------------------------------------------------
2022-08-23 18:29:50+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:29:50+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:29:50+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:29:52+0800: OpenCL backend thread 0: Device 0 FP16Storage false FP16Compute false FP16TensorCores false
2022-08-23 18:29:52+0800: Loaded config ../cpp/configs/misc/coreml_example.cfg
2022-08-23 18:29:52+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-08-23 18:29:52+0800: GPU 0 finishing, processed 5 rows 5 batches
2022-08-23 18:29:52+0800: nnRandSeed0 = 855502274329083860
2022-08-23 18:29:52+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:29:52+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:29:53+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:53+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:29:53+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:29:53+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:29:53+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:29:53+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:29:54+0800: OpenCL backend thread 0: Device 0 Model version 8
2022-08-23 18:29:54+0800: OpenCL backend thread 0: Device 0 Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:29:55+0800: OpenCL backend thread 0: Device 0 FP16Storage false FP16Compute false FP16TensorCores false
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 210.41 nnEvals/s = 168.22 nnBatches/s = 67.62 avgBatchSize = 2.49 (38.2 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.03 nnEvals/s = 168.86 nnBatches/s = 28.53 avgBatchSize = 5.92 (39.7 secs)
numSearchThreads = 3: 10 / 10 positions, visits/s = 208.78 nnEvals/s = 169.45 nnBatches/s = 113.08 avgBatchSize = 1.50 (38.4 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 210.25 nnEvals/s = 169.30 nnBatches/s = 56.83 avgBatchSize = 2.98 (38.3 secs)
numSearchThreads = 2: 10 / 10 positions, visits/s = 208.05 nnEvals/s = 168.60 nnBatches/s = 168.55 avgBatchSize = 1.00 (38.5 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 208.15 nnEvals/s = 169.32 nnBatches/s = 84.89 avgBatchSize = 1.99 (38.6 secs)
numSearchThreads = 1: 10 / 10 positions, visits/s = 202.75 nnEvals/s = 165.87 nnBatches/s = 165.87 avgBatchSize = 1.00 (39.5 secs)
Ordered summary of results:
numSearchThreads = 1: 10 / 10 positions, visits/s = 202.75 nnEvals/s = 165.87 nnBatches/s = 165.87 avgBatchSize = 1.00 (39.5 secs) (EloDiff baseline)
numSearchThreads = 2: 10 / 10 positions, visits/s = 208.05 nnEvals/s = 168.60 nnBatches/s = 168.55 avgBatchSize = 1.00 (38.5 secs) (EloDiff +3)
numSearchThreads = 3: 10 / 10 positions, visits/s = 208.78 nnEvals/s = 169.45 nnBatches/s = 113.08 avgBatchSize = 1.50 (38.4 secs) (EloDiff -2)
numSearchThreads = 4: 10 / 10 positions, visits/s = 208.15 nnEvals/s = 169.32 nnBatches/s = 84.89 avgBatchSize = 1.99 (38.6 secs) (EloDiff -9)
numSearchThreads = 5: 10 / 10 positions, visits/s = 210.41 nnEvals/s = 168.22 nnBatches/s = 67.62 avgBatchSize = 2.49 (38.2 secs) (EloDiff -11)
numSearchThreads = 6: 10 / 10 positions, visits/s = 210.25 nnEvals/s = 169.30 nnBatches/s = 56.83 avgBatchSize = 2.98 (38.3 secs) (EloDiff -18)
numSearchThreads = 12: 10 / 10 positions, visits/s = 204.03 nnEvals/s = 168.86 nnBatches/s = 28.53 avgBatchSize = 5.92 (39.7 secs) (EloDiff -67)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 1: (baseline)
numSearchThreads = 2: +3 Elo (recommended)
numSearchThreads = 3: -2 Elo
numSearchThreads = 4: -9 Elo
numSearchThreads = 5: -11 Elo
numSearchThreads = 6: -18 Elo
numSearchThreads = 12: -67 Elo
If you care about performance, you may want to edit numSearchThreads in ../cpp/configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../cpp/configs/misc/coreml_example.cfg
2022-08-23 18:34:27+0800: GPU 0 finishing, processed 45704 rows 26552 batches
The following benchmark is from the original katago:
% /opt/homebrew/bin/katago benchmark -config $(brew list --verbose katago | grep gtp_example.cfg) -model $(brew list --verbose katago | grep .gz | head -1)
2022-08-23 18:54:22+0800: Loading model and initializing benchmark...
2022-08-23 18:54:22+0800: Testing with default positions for board size: 19
2022-08-23 18:54:22+0800: nnRandSeed0 = 11121051184250860705
2022-08-23 18:54:22+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:54:22+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:54:23+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:54:23+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:54:23+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:54:23+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:54:23+0800: Performing autotuning
2022-08-23 18:54:23+0800: *** On some systems, this may take several minutes, please be patient ***
2022-08-23 18:54:23+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:54:23+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:54:23+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:23+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M1 modelVersion 8 channels 320
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 734.829 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 1/56 Calls/sec 737.741 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 14806.1 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 3/56 Calls/sec 20445.5 L2Error 0 WGD=16 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 4/56 Calls/sec 25348.7 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=8 VWMD=4 VWND=2 PADA=1 PADB=1
Tuning 6/56 Calls/sec 27950.6 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 20/56 ...
Tuning 40/56 ...
Tuning 51/56 Calls/sec 43289.8 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=16 KWID=8 VWMD=2 VWND=2 PADA=1 PADB=1
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1201 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 10339.1 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 13410.7 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 5/70 Calls/sec 13935.9 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 7/70 Calls/sec 17309.5 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 41/70 Calls/sec 24639.8 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 1980.93 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 16560 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 24493.4 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 60/70 ...
Tuning 61/70 Calls/sec 25643.1 L2Error 0 MWG=32 NWG=32 KWG=32 MDIMC=16 NDIMC=16 MDIMA=16 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
FP16 storage not significantly faster, not enabling on its own
------------------------------------------------------
Using FP32 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 54421.8 L2Error 0 transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 162156 L2Error 0 transLocalSize0=32 transLocalSize1=8
Tuning 4/47 Calls/sec 201234 L2Error 0 transLocalSize0=8 transLocalSize1=2
Tuning 16/47 Calls/sec 212234 L2Error 0 transLocalSize0=4 transLocalSize1=1
Tuning 33/47 Calls/sec 217754 L2Error 0 transLocalSize0=16 transLocalSize1=4
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 52583.6 L2Error 0 untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 2/111 Calls/sec 57195.1 L2Error 0 untransLocalSize0=2 untransLocalSize1=1 untransLocalSize2=4
Tuning 4/111 Calls/sec 83815.3 L2Error 0 untransLocalSize0=2 untransLocalSize1=2 untransLocalSize2=2
Tuning 9/111 Calls/sec 107732 L2Error 0 untransLocalSize0=8 untransLocalSize1=1 untransLocalSize2=2
Tuning 20/111 ...
Tuning 29/111 Calls/sec 108838 L2Error 0 untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=1
Tuning 40/111 ...
Tuning 47/111 Calls/sec 110162 L2Error 0 untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=2
Tuning 60/111 ...
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 276464 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 1/106 Calls/sec 277790 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 857091 L2Error 1.21663e-11 XYSTRIDE=4 CHANNELSTRIDE=2 BATCHSTRIDE=4
Tuning 5/106 Calls/sec 1.89791e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=2
Tuning 20/106 ...
Tuning 40/106 ...
Tuning 58/106 Calls/sec 1.9084e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=1
Tuning 80/106 ...
Tuning 100/106 ...
Tuning 101/106 Calls/sec 1.91359e+06 L2Error 1.07414e-11 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=2
Done tuning
------------------------------------------------------
2022-08-23 18:54:47+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:54:48+0800: OpenCL backend thread 0: Model version 8
2022-08-23 18:54:48+0800: OpenCL backend thread 0: Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:54:49+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false
2022-08-23 18:54:50+0800: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-23 18:54:50+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.
Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-08-23 18:54:50+0800: GPU -1 finishing, processed 5 rows 5 batches
2022-08-23 18:54:50+0800: nnRandSeed0 = 6224177368933988412
2022-08-23 18:54:50+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b30c320x2-s4824661760-d1229536699.bin.gz useFP16 auto useNHWC auto
2022-08-23 18:54:50+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-23 18:54:51+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:51+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-23 18:54:51+0800: Found OpenCL Device 0: Apple M1 (Apple) (score 1000102)
2022-08-23 18:54:51+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-23 18:54:51+0800: Using OpenCL Device 0: Apple M1 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-23 18:54:51+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM1_x19_y19_c320_mv8.txt
2022-08-23 18:54:51+0800: OpenCL backend thread 0: Model version 8
2022-08-23 18:54:51+0800: OpenCL backend thread 0: Model name: g170-b30c320x2-s4824661760-d1229536699
2022-08-23 18:54:52+0800: OpenCL backend thread 0: FP16Storage false FP16Compute false FP16TensorCores false
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 32.49 nnEvals/s = 27.29 nnBatches/s = 10.97 avgBatchSize = 2.49 (247.4 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 38.18 nnEvals/s = 31.28 nnBatches/s = 5.29 avgBatchSize = 5.91 (212.2 secs)
numSearchThreads = 3: 10 / 10 positions, visits/s = 24.52 nnEvals/s = 20.42 nnBatches/s = 13.63 avgBatchSize = 1.50 (327.1 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 32.69 nnEvals/s = 26.87 nnBatches/s = 9.02 avgBatchSize = 2.98 (246.1 secs)
numSearchThreads = 8: 10 / 10 positions, visits/s = 34.11 nnEvals/s = 27.67 nnBatches/s = 6.98 avgBatchSize = 3.96 (236.3 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 22.86 nnEvals/s = 18.96 nnBatches/s = 9.51 avgBatchSize = 1.99 (351.2 secs)
Ordered summary of results:
numSearchThreads = 3: 10 / 10 positions, visits/s = 24.52 nnEvals/s = 20.42 nnBatches/s = 13.63 avgBatchSize = 1.50 (327.1 secs) (EloDiff baseline)
numSearchThreads = 4: 10 / 10 positions, visits/s = 22.86 nnEvals/s = 18.96 nnBatches/s = 9.51 avgBatchSize = 1.99 (351.2 secs) (EloDiff -37)
numSearchThreads = 5: 10 / 10 positions, visits/s = 32.49 nnEvals/s = 27.29 nnBatches/s = 10.97 avgBatchSize = 2.49 (247.4 secs) (EloDiff +81)
numSearchThreads = 6: 10 / 10 positions, visits/s = 32.69 nnEvals/s = 26.87 nnBatches/s = 9.02 avgBatchSize = 2.98 (246.1 secs) (EloDiff +73)
numSearchThreads = 8: 10 / 10 positions, visits/s = 34.11 nnEvals/s = 27.67 nnBatches/s = 6.98 avgBatchSize = 3.96 (236.3 secs) (EloDiff +67)
numSearchThreads = 12: 10 / 10 positions, visits/s = 38.18 nnEvals/s = 31.28 nnBatches/s = 5.29 avgBatchSize = 5.91 (212.2 secs) (EloDiff +67)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 3: (baseline)
numSearchThreads = 4: -37 Elo
numSearchThreads = 5: +81 Elo (recommended)
numSearchThreads = 6: +73 Elo
numSearchThreads = 8: +67 Elo
numSearchThreads = 12: +67 Elo
If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-23 19:21:55+0800: GPU -1 finishing, processed 39878 rows 15507 batches
@horaceho
Your CoreML performance:
numSearchThreads = 2: 10 / 10 positions, visits/s = 208.05 nnEvals/s = 168.60 nnBatches/s = 168.55 avgBatchSize = 1.00 (38.5 secs) (EloDiff +3)
Your OpenCL performance:
numSearchThreads = 5: 10 / 10 positions, visits/s = 32.49 nnEvals/s = 27.29 nnBatches/s = 10.97 avgBatchSize = 2.49 (247.4 secs) (EloDiff +81)
CoreML performance appears to be ~6x faster than OpenCL, but their network sizes are different. The CoreML model is converted from a b40c256 network, but your OpenCL benchmark runs with a b30c320x2 network.
2022-08-23 18:54:51+0800: OpenCL backend thread 0: Model name: g170-b30c320x2-s4824661760-d1229536699
Did you run OpenCL benchmark with a b40c256 network? I would like to see its performance number.
@ChinChangYang Benchmark of model g170-b40c256x2-s5095420928-d1229425124
:
% /opt/homebrew/bin/katago benchmark -config $(brew list --verbose katago | grep gtp_example.cfg) -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-09-05 09:36:38+0800: Loading model and initializing benchmark...
2022-09-05 09:36:38+0800: Testing with default positions for board size: 19
2022-09-05 09:36:38+0800: nnRandSeed0 = 12541034244607286625
2022-09-05 09:36:38+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:36:38+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:36:39+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:36:39+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:36:39+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:36:39+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:36:39+0800: Performing autotuning
2022-09-05 09:36:39+0800: *** On some systems, this may take several minutes, please be patient ***
2022-09-05 09:36:39+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:36:39+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:36:39+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:39+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M2 modelVersion 8 channels 256
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 1777.34 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 97770.3 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 3/56 Calls/sec 150561 L2Error 0 WGD=16 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 4/56 Calls/sec 194481 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=16 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 6/56 Calls/sec 195200 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=16 NDIMBD=16 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 18/56 Calls/sec 197001 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=4 PADA=1 PADB=1
Tuning 40/56 ...
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 2978.41 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 31199.1 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 50931.1 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 63185 L2Error 0 MWG=32 NWG=32 KWG=16 MDIMC=16 NDIMC=16 MDIMA=16 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 6/70 Calls/sec 91689.7 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 4956.83 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 80034.3 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 116233 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 5/70 Calls/sec 138009 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=1 SB=1
Tuning 15/70 Calls/sec 145099 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 40/70 ...
Tuning 60/70 ...
Enabling FP16 storage due to better performance
------------------------------------------------------
Using FP16 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 197924 L2Error 0 transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 343958 L2Error 0 transLocalSize0=2 transLocalSize1=4
Tuning 3/47 Calls/sec 1.0029e+06 L2Error 0 transLocalSize0=32 transLocalSize1=4
Tuning 20/47 ...
Tuning 22/47 Calls/sec 1.01523e+06 L2Error 0 transLocalSize0=32 transLocalSize1=8
Tuning 26/47 Calls/sec 1.02998e+06 L2Error 0 transLocalSize0=64 transLocalSize1=1
Tuning 36/47 Calls/sec 1.03639e+06 L2Error 0 transLocalSize0=8 transLocalSize1=1
Tuning 41/47 Calls/sec 1.06458e+06 L2Error 0 transLocalSize0=128 transLocalSize1=2
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 309098 L2Error 0 untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 1/111 Calls/sec 309172 L2Error 0 untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 4/111 Calls/sec 529349 L2Error 0 untransLocalSize0=32 untransLocalSize1=2 untransLocalSize2=2
Tuning 6/111 Calls/sec 931484 L2Error 0 untransLocalSize0=16 untransLocalSize1=2 untransLocalSize2=8
Tuning 9/111 Calls/sec 1.27352e+06 L2Error 0 untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=8
Tuning 20/111 ...
Tuning 22/111 Calls/sec 1.311e+06 L2Error 0 untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=4
Tuning 29/111 Calls/sec 1.36882e+06 L2Error 0 untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=2
Tuning 40/111 ...
Tuning 60/111 ...
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 1.19927e+06 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 1/106 Calls/sec 1.27082e+06 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 3.71021e+06 L2Error 2.73781e-13 XYSTRIDE=8 CHANNELSTRIDE=8 BATCHSTRIDE=4
Tuning 9/106 Calls/sec 4.30156e+06 L2Error 2.73781e-13 XYSTRIDE=8 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 11/106 Calls/sec 5.29248e+06 L2Error 2.73781e-13 XYSTRIDE=16 CHANNELSTRIDE=4 BATCHSTRIDE=4
Tuning 15/106 Calls/sec 5.98614e+06 L2Error 2.73781e-13 XYSTRIDE=16 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 25/106 Calls/sec 6.13497e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=4
Tuning 40/106 Calls/sec 6.56985e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 44/106 Calls/sec 6.58807e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=2
Tuning 60/106 ...
Tuning 65/106 Calls/sec 6.6133e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=4 BATCHSTRIDE=1
Tuning 80/106 ...
Tuning 100/106 ...
Done tuning
------------------------------------------------------
2022-09-05 09:36:54+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:36:55+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:36:55+0800: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 09:36:56+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false
2022-09-05 09:36:57+0800: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-09-05 09:36:57+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.
Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-09-05 09:36:57+0800: GPU -1 finishing, processed 5 rows 5 batches
2022-09-05 09:36:57+0800: nnRandSeed0 = 14708806141274676129
2022-09-05 09:36:57+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:36:57+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:36:58+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:58+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:36:58+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:36:58+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:36:58+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:36:58+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:36:58+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:36:58+0800: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 09:36:59+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 109.96 nnEvals/s = 93.32 nnBatches/s = 37.47 avgBatchSize = 2.49 (73.1 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 113.88 nnEvals/s = 93.38 nnBatches/s = 15.77 avgBatchSize = 5.92 (71.2 secs)
numSearchThreads = 3: 10 / 10 positions, visits/s = 67.04 nnEvals/s = 56.50 nnBatches/s = 37.73 avgBatchSize = 1.50 (119.6 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 97.59 nnEvals/s = 79.33 nnBatches/s = 26.63 avgBatchSize = 2.98 (82.5 secs)
numSearchThreads = 8: 10 / 10 positions, visits/s = 94.84 nnEvals/s = 80.30 nnBatches/s = 20.26 avgBatchSize = 3.96 (85.1 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 67.62 nnEvals/s = 56.96 nnBatches/s = 28.59 avgBatchSize = 1.99 (118.8 secs)
Ordered summary of results:
numSearchThreads = 3: 10 / 10 positions, visits/s = 67.04 nnEvals/s = 56.50 nnBatches/s = 37.73 avgBatchSize = 1.50 (119.6 secs) (EloDiff baseline)
numSearchThreads = 4: 10 / 10 positions, visits/s = 67.62 nnEvals/s = 56.96 nnBatches/s = 28.59 avgBatchSize = 1.99 (118.8 secs) (EloDiff -6)
numSearchThreads = 5: 10 / 10 positions, visits/s = 109.96 nnEvals/s = 93.32 nnBatches/s = 37.47 avgBatchSize = 2.49 (73.1 secs) (EloDiff +166)
numSearchThreads = 6: 10 / 10 positions, visits/s = 97.59 nnEvals/s = 79.33 nnBatches/s = 26.63 avgBatchSize = 2.98 (82.5 secs) (EloDiff +113)
numSearchThreads = 8: 10 / 10 positions, visits/s = 94.84 nnEvals/s = 80.30 nnBatches/s = 20.26 avgBatchSize = 3.96 (85.1 secs) (EloDiff +85)
numSearchThreads = 12: 10 / 10 positions, visits/s = 113.88 nnEvals/s = 93.38 nnBatches/s = 15.77 avgBatchSize = 5.92 (71.2 secs) (EloDiff +123)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 3: (baseline)
numSearchThreads = 4: -6 Elo
numSearchThreads = 5: +166 Elo (recommended)
numSearchThreads = 6: +113 Elo
numSearchThreads = 8: +85 Elo
numSearchThreads = 12: +123 Elo
If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-09-05 09:46:10+0800: GPU -1 finishing, processed 40377 rows 15697 batches
Benchmark of model g170e-b20c256x2-s5303129600-d1228401921
on M2:
% /opt/homebrew/bin/katago benchmark -config $(brew list --verbose katago | grep gtp_example.cfg) -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz
2022-09-05 09:50:12+0800: Loading model and initializing benchmark...
2022-09-05 09:50:12+0800: Testing with default positions for board size: 19
2022-09-05 09:50:12+0800: nnRandSeed0 = 16568776881164312156
2022-09-05 09:50:12+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:50:12+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:50:12+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:50:12+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:50:12+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:50:12+0800: No existing tuning parameters found or parseable or valid at: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:50:12+0800: Performing autotuning
2022-09-05 09:50:12+0800: *** On some systems, this may take several minutes, please be patient ***
2022-09-05 09:50:12+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:50:12+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:50:12+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:12+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
Beginning GPU tuning for Apple M2 modelVersion 8 channels 256
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs
Tuning 0/56 (reference) Calls/sec 1778.53 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 1/56 Calls/sec 1783.64 L2Error 0 WGD=8 MDIMCD=1 NDIMCD=1 MDIMAD=1 NDIMBD=1 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 2/56 Calls/sec 97617 L2Error 0 WGD=8 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 3/56 Calls/sec 149727 L2Error 0 WGD=16 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=1 VWMD=1 VWND=1 PADA=1 PADB=1
Tuning 4/56 Calls/sec 159971 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=16 KWID=2 VWMD=4 VWND=2 PADA=1 PADB=1
Tuning 5/56 Calls/sec 179006 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=8 VWMD=4 VWND=2 PADA=1 PADB=1
Tuning 7/56 Calls/sec 195008 L2Error 0 WGD=32 MDIMCD=8 NDIMCD=16 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=2 PADA=1 PADB=1
Tuning 20/56 ...
Tuning 38/56 Calls/sec 197076 L2Error 0 WGD=32 MDIMCD=16 NDIMCD=8 MDIMAD=8 NDIMBD=8 KWID=2 VWMD=2 VWND=4 PADA=1 PADB=1
------------------------------------------------------
Tuning xGemm for convolutions
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 2972.14 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 2/70 Calls/sec 31191 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 3/70 Calls/sec 50849.2 L2Error 0 MWG=16 NWG=16 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 4/70 Calls/sec 84860 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 11/70 Calls/sec 87030.9 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=16 MDIMA=16 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 12/70 Calls/sec 87867.1 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=16 NDIMC=8 MDIMA=16 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 17/70 Calls/sec 89663.9 L2Error 0 MWG=64 NWG=64 KWG=16 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=2 VWN=2 STRM=0 STRN=0 SA=0 SB=0
Tuning 28/70 Calls/sec 91231.1 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=8 MDIMA=8 NDIMB=8 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 40/70 ...
Tuning 44/70 Calls/sec 91348.2 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 60/70 ...
------------------------------------------------------
Tuning hGemmWmma for convolutions
Testing 146 different configs
UNSUPPORTED (log once): buildComputeProgram: cl2Metal failed
FP16 tensor core tuning failed, assuming no FP16 tensor core support
------------------------------------------------------
Tuning xGemm for convolutions - trying with FP16 storage
Testing 70 different configs
Tuning 0/70 (reference) Calls/sec 4944.09 L2Error 0 MWG=8 NWG=8 KWG=8 MDIMC=1 NDIMC=1 MDIMA=1 NDIMB=1 KWI=1 VWM=1 VWN=1 STRM=0 STRN=0 SA=0 SB=0
Tuning 1/70 Calls/sec 142640 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=0 SB=0
Tuning 5/70 Calls/sec 145370 L2Error 0 MWG=64 NWG=64 KWG=32 MDIMC=8 NDIMC=16 MDIMA=8 NDIMB=16 KWI=2 VWM=4 VWN=4 STRM=0 STRN=0 SA=1 SB=1
Tuning 20/70 ...
Tuning 40/70 ...
Tuning 60/70 ...
Enabling FP16 storage due to better performance
------------------------------------------------------
Using FP16 storage!
Using FP32 compute!
------------------------------------------------------
Tuning winograd transform for convolutions
Testing 47 different configs
Tuning 0/47 (reference) Calls/sec 198693 L2Error 0 transLocalSize0=1 transLocalSize1=1
Tuning 2/47 Calls/sec 1.01488e+06 L2Error 0 transLocalSize0=64 transLocalSize1=2
Tuning 3/47 Calls/sec 1.03914e+06 L2Error 0 transLocalSize0=64 transLocalSize1=4
Tuning 20/47 Calls/sec 1.0772e+06 L2Error 0 transLocalSize0=128 transLocalSize1=1
Tuning 40/47 ...
------------------------------------------------------
Tuning winograd untransform for convolutions
Testing 111 different configs
Tuning 0/111 (reference) Calls/sec 308907 L2Error 0 untransLocalSize0=1 untransLocalSize1=1 untransLocalSize2=1
Tuning 2/111 Calls/sec 403461 L2Error 0 untransLocalSize0=1 untransLocalSize1=2 untransLocalSize2=4
Tuning 3/111 Calls/sec 1.26298e+06 L2Error 0 untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=4
Tuning 4/111 Calls/sec 1.29255e+06 L2Error 0 untransLocalSize0=8 untransLocalSize1=4 untransLocalSize2=2
Tuning 20/111 ...
Tuning 33/111 Calls/sec 1.36384e+06 L2Error 0 untransLocalSize0=8 untransLocalSize1=2 untransLocalSize2=1
Tuning 60/111 ...
Tuning 80/111 ...
Tuning 100/111 ...
------------------------------------------------------
Tuning global pooling strides
Testing 106 different configs
Tuning 0/106 (reference) Calls/sec 1.25049e+06 L2Error 0 XYSTRIDE=1 CHANNELSTRIDE=1 BATCHSTRIDE=1
Tuning 2/106 Calls/sec 2.38215e+06 L2Error 2.73781e-13 XYSTRIDE=4 CHANNELSTRIDE=8 BATCHSTRIDE=2
Tuning 4/106 Calls/sec 5.92639e+06 L2Error 2.73781e-13 XYSTRIDE=16 CHANNELSTRIDE=1 BATCHSTRIDE=4
Tuning 20/106 ...
Tuning 32/106 Calls/sec 6.27684e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=2 BATCHSTRIDE=1
Tuning 35/106 Calls/sec 6.34603e+06 L2Error 2.73781e-13 XYSTRIDE=32 CHANNELSTRIDE=1 BATCHSTRIDE=4
Tuning 60/106 ...
Tuning 80/106 ...
Tuning 100/106 ...
Done tuning
------------------------------------------------------
2022-09-05 09:50:20+0800: Done tuning, saved results to /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:50:20+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:50:20+0800: OpenCL backend thread 0: Model name: g170-b20c256x2-s5303129600-d1228401921
2022-09-05 09:50:20+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false
2022-09-05 09:50:21+0800: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-09-05 09:50:21+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.
Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-09-05 09:50:21+0800: GPU -1 finishing, processed 5 rows 5 batches
2022-09-05 09:50:21+0800: nnRandSeed0 = 9020088241128174097
2022-09-05 09:50:21+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170e-b20c256x2-s5303129600-d1228401921.bin.gz useFP16 auto useNHWC auto
2022-09-05 09:50:21+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 09:50:21+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:21+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-09-05 09:50:21+0800: Found OpenCL Device 0: Apple M2 (Apple) (score 1000102)
2022-09-05 09:50:21+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-09-05 09:50:21+0800: Using OpenCL Device 0: Apple M2 (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-09-05 09:50:21+0800: Loaded tuning parameters from: /Users/ohho/.katago/opencltuning/tune8_gpuAppleM2_x19_y19_c256_mv8.txt
2022-09-05 09:50:21+0800: OpenCL backend thread 0: Model version 8
2022-09-05 09:50:21+0800: OpenCL backend thread 0: Model name: g170-b20c256x2-s5303129600-d1228401921
2022-09-05 09:50:22+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 218.18 nnEvals/s = 175.66 nnBatches/s = 70.58 avgBatchSize = 2.49 (36.8 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 293.21 nnEvals/s = 246.25 nnBatches/s = 41.58 avgBatchSize = 5.92 (27.6 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 267.98 nnEvals/s = 219.78 nnBatches/s = 44.53 avgBatchSize = 4.94 (30.2 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 235.11 nnEvals/s = 204.23 nnBatches/s = 20.85 avgBatchSize = 9.79 (34.8 secs)
numSearchThreads = 8: 10 / 10 positions, visits/s = 165.52 nnEvals/s = 139.19 nnBatches/s = 35.09 avgBatchSize = 3.97 (48.8 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 196.70 nnEvals/s = 166.01 nnBatches/s = 21.13 avgBatchSize = 7.86 (41.4 secs)
Ordered summary of results:
numSearchThreads = 5: 10 / 10 positions, visits/s = 218.18 nnEvals/s = 175.66 nnBatches/s = 70.58 avgBatchSize = 2.49 (36.8 secs) (EloDiff baseline)
numSearchThreads = 8: 10 / 10 positions, visits/s = 165.52 nnEvals/s = 139.19 nnBatches/s = 35.09 avgBatchSize = 3.97 (48.8 secs) (EloDiff -124)
numSearchThreads = 10: 10 / 10 positions, visits/s = 267.98 nnEvals/s = 219.78 nnBatches/s = 44.53 avgBatchSize = 4.94 (30.2 secs) (EloDiff +50)
numSearchThreads = 12: 10 / 10 positions, visits/s = 293.21 nnEvals/s = 246.25 nnBatches/s = 41.58 avgBatchSize = 5.92 (27.6 secs) (EloDiff +74)
numSearchThreads = 16: 10 / 10 positions, visits/s = 196.70 nnEvals/s = 166.01 nnBatches/s = 21.13 avgBatchSize = 7.86 (41.4 secs) (EloDiff -109)
numSearchThreads = 20: 10 / 10 positions, visits/s = 235.11 nnEvals/s = 204.23 nnBatches/s = 20.85 avgBatchSize = 9.79 (34.8 secs) (EloDiff -60)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 5: (baseline)
numSearchThreads = 8: -124 Elo
numSearchThreads = 10: +50 Elo
numSearchThreads = 12: +74 Elo (recommended)
numSearchThreads = 16: -109 Elo
numSearchThreads = 20: -60 Elo
If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-09-05 09:54:03+0800: GPU -1 finishing, processed 40688 rows 8411 batches
@horaceho
Excellent!
According to your performance number of the network b40c256x2, CoreML backend is about 2x faster than OpenCL backend. It is very encouraging.
I have improved CoreML backend performance by running two threads in my branch. The recipe is shown as follows.
git clone https://github.com/ChinChangYang/KataGo.git -b multi-coreml-backend
cd KataGo/cpp
mkdir build
cd build
cmake ../ -DUSE_BACKEND=COREML
make
wget https://github.com/ChinChangYang/KataGo/releases/download/v1.11.0-coreml1/KataGoModel.mlpackage.zip
unzip KataGoModel.mlpackage.zip
./katago benchmark -config ../configs/misc/coreml_example.cfg
It does not need Xcode to build KataGo anymore. Just cmake with CoreML backend, make, get a model, and run KataGo. It is easy now.
It significantly improves CoreML backend overall performance in my M1 machine. I would like to see whether it also improves performance in M2 machine.
Thanks!
It significantly improves CoreML backend overall performance in my M1 machine. I would like to see whether it also improves performance in M2 machine.
Both benchmarks posted by @horaceho are from M2 devices. 😉
It significantly improves CoreML backend overall performance in my M1 machine. I would like to see whether it also improves performance in M2 machine.
Both benchmarks posted by @horaceho are from M2 devices. 😉
In v1.11.0-coreml1
, CoreML only runs in 1 thread. In the multi-coreml-backend
branch, CoreML can run in 2 threads, and it further improves CoreML backend overall performance in my M1 machine. I would like to share my result with you.
OpenCL:
Ordered summary of results:
numSearchThreads = 5: 10 / 10 positions, visits/s = 127.91 nnEvals/s = 105.35 nnBatches/s = 42.33 avgBatchSize = 2.49 (62.9 secs) (EloDiff baseline)
numSearchThreads = 8: 10 / 10 positions, visits/s = 163.03 nnEvals/s = 134.23 nnBatches/s = 33.88 avgBatchSize = 3.96 (49.5 secs) (EloDiff +70)
numSearchThreads = 10: 10 / 10 positions, visits/s = 168.67 nnEvals/s = 138.70 nnBatches/s = 28.07 avgBatchSize = 4.94 (48.0 secs) (EloDiff +70)
numSearchThreads = 12: 10 / 10 positions, visits/s = 178.48 nnEvals/s = 146.17 nnBatches/s = 24.69 avgBatchSize = 5.92 (45.4 secs) (EloDiff +78)
numSearchThreads = 16: 10 / 10 positions, visits/s = 195.88 nnEvals/s = 163.82 nnBatches/s = 20.83 avgBatchSize = 7.86 (41.6 secs) (EloDiff +90)
numSearchThreads = 20: 10 / 10 positions, visits/s = 196.37 nnEvals/s = 169.76 nnBatches/s = 17.37 avgBatchSize = 9.77 (41.7 secs) (EloDiff +65)
The multi-coreml-backend
branch:
Ordered summary of results:
numSearchThreads = 3: 10 / 10 positions, visits/s = 239.01 nnEvals/s = 194.19 nnBatches/s = 194.13 avgBatchSize = 1.00 (33.6 secs) (EloDiff baseline)
numSearchThreads = 4: 10 / 10 positions, visits/s = 244.04 nnEvals/s = 202.59 nnBatches/s = 162.14 avgBatchSize = 1.25 (32.9 secs) (EloDiff +2)
numSearchThreads = 5: 10 / 10 positions, visits/s = 232.76 nnEvals/s = 197.26 nnBatches/s = 138.44 avgBatchSize = 1.42 (34.5 secs) (EloDiff -22)
numSearchThreads = 6: 10 / 10 positions, visits/s = 234.60 nnEvals/s = 199.10 nnBatches/s = 120.42 avgBatchSize = 1.65 (34.3 secs) (EloDiff -24)
numSearchThreads = 10: 10 / 10 positions, visits/s = 238.27 nnEvals/s = 197.12 nnBatches/s = 86.83 avgBatchSize = 2.27 (34.0 secs) (EloDiff -42)
numSearchThreads = 20: 10 / 10 positions, visits/s = 230.38 nnEvals/s = 197.37 nnBatches/s = 53.35 avgBatchSize = 3.70 (35.5 secs) (EloDiff -114)
By the way, the multi-coreml-backend
branch also supports arbitrary board sizes up to 19x19 now.
@ChinChangYang Benchmark of multi-coreml-backend
on M2:
% ./katago benchmark -config ../configs/misc/coreml_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-09-05 14:18:03+0800: Running with following config:
allowResignation = true
coremlDeviceToUseThread0 = 0
coremlDeviceToUseThread1 = 1
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 2
numSearchThreads = 3
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2022-09-05 14:18:03+0800: Loading model and initializing benchmark...
2022-09-05 14:18:03+0800: Testing with default positions for board size: 19
2022-09-05 14:18:03+0800: nnRandSeed0 = 13011196210133956686
2022-09-05 14:18:03+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 14:18:03+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 14:18:03+0800: CoreML backend thread 1: Device 1 Model version 8
2022-09-05 14:18:03+0800: CoreML backend thread 1: Device 1 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:03+0800: CoreML backend thread 0: Device 0 Model version 8
2022-09-05 14:18:03+0800: CoreML backend thread 0: Device 0 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:07+0800: CoreML backend thread 0: Device 0
2022-09-05 14:18:10+0800: CoreML backend thread 1: Device 1
2022-09-05 14:18:10+0800: Loaded config ../configs/misc/coreml_example.cfg
2022-09-05 14:18:10+0800: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the CoreML version of KataGo.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-09-05 14:18:10+0800: GPU 1 finishing, processed 2 rows 2 batches
2022-09-05 14:18:10+0800: GPU 0 finishing, processed 3 rows 3 batches
2022-09-05 14:18:11+0800: nnRandSeed0 = 8630180620945567969
2022-09-05 14:18:11+0800: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-09-05 14:18:11+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-09-05 14:18:11+0800: CoreML backend thread 1: Device 1 Model version 8
2022-09-05 14:18:11+0800: CoreML backend thread 1: Device 1 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:11+0800: CoreML backend thread 0: Device 0 Model version 8
2022-09-05 14:18:11+0800: CoreML backend thread 0: Device 0 Model name: g170-b40c256x2-s5095420928-d1229425124
2022-09-05 14:18:15+0800: CoreML backend thread 1: Device 1
2022-09-05 14:18:18+0800: CoreML backend thread 0: Device 0
Possible numbers of threads to test: 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48,
numSearchThreads = 6: 10 / 10 positions, visits/s = 305.77 nnEvals/s = 250.57 nnBatches/s = 147.74 avgBatchSize = 1.70 (26.3 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 287.30 nnEvals/s = 251.75 nnBatches/s = 54.15 avgBatchSize = 4.65 (28.5 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 317.60 nnEvals/s = 254.00 nnBatches/s = 203.26 avgBatchSize = 1.25 (25.3 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 307.45 nnEvals/s = 251.39 nnBatches/s = 121.25 avgBatchSize = 2.07 (26.3 secs)
numSearchThreads = 3: 10 / 10 positions, visits/s = 313.52 nnEvals/s = 252.93 nnBatches/s = 252.69 avgBatchSize = 1.00 (25.6 secs)
numSearchThreads = 2: 10 / 10 positions, visits/s = 309.78 nnEvals/s = 251.96 nnBatches/s = 251.96 avgBatchSize = 1.00 (25.9 secs)
Ordered summary of results:
numSearchThreads = 2: 10 / 10 positions, visits/s = 309.78 nnEvals/s = 251.96 nnBatches/s = 251.96 avgBatchSize = 1.00 (25.9 secs) (EloDiff baseline)
numSearchThreads = 3: 10 / 10 positions, visits/s = 313.52 nnEvals/s = 252.93 nnBatches/s = 252.69 avgBatchSize = 1.00 (25.6 secs) (EloDiff -1)
numSearchThreads = 4: 10 / 10 positions, visits/s = 317.60 nnEvals/s = 254.00 nnBatches/s = 203.26 avgBatchSize = 1.25 (25.3 secs) (EloDiff -1)
numSearchThreads = 6: 10 / 10 positions, visits/s = 305.77 nnEvals/s = 250.57 nnBatches/s = 147.74 avgBatchSize = 1.70 (26.3 secs) (EloDiff -25)
numSearchThreads = 10: 10 / 10 positions, visits/s = 307.45 nnEvals/s = 251.39 nnBatches/s = 121.25 avgBatchSize = 2.07 (26.3 secs) (EloDiff -43)
numSearchThreads = 20: 10 / 10 positions, visits/s = 287.30 nnEvals/s = 251.75 nnBatches/s = 54.15 avgBatchSize = 4.65 (28.5 secs) (EloDiff -122)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 2: (baseline) (recommended)
numSearchThreads = 3: -1 Elo
numSearchThreads = 4: -1 Elo
numSearchThreads = 6: -25 Elo
numSearchThreads = 10: -43 Elo
numSearchThreads = 20: -122 Elo
If you care about performance, you may want to edit numSearchThreads in ../configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../configs/misc/coreml_example.cfg
2022-09-05 14:20:56+0800: GPU 0 finishing, processed 19907 rows 13397 batches
2022-09-05 14:20:56+0800: GPU 1 finishing, processed 19888 rows 13348 batches
I released a KataGo v1.11.0-coreml2
that can run multiple threads for a CoreML model in this link: https://github.com/ChinChangYang/KataGo/releases/tag/v1.11.0-coreml2
@ChinChangYang Congratulations on the great work on CoreML!
Any chance you also publish the documentation of how to build and run the project within Xcode? I am thinking of porting the engine to iPadOS ...
@ChinChangYang Congratulations on the great work on CoreML!
Any chance you also publish the documentation of how to build and run the project within Xcode? I am thinking of porting the engine to iPadOS ...
To generate an Xcode project, just add -G Xcode
to the cmake
argument list. I can generate an Xcode project by the following command:
cmake ../ -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=arm64 -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_CXX_FLAGS='-DNO_LIBZIP=1' -DCMAKE_BUILD_TYPE=Release -DUSE_BACKEND=COREML -G Xcode
Then, open the Xcode project by:
open katago.xcodeproj
Katago with OpenCL has better performance on M1 Max than katago with coreml, But katago with opencl make the computer fans run crazy and the laptop is super hot.
Katago with OpenCL has better performance on M1 Max than katago with coreml, But katago with opencl make the computer fans run crazy and the laptop is super hot.
It makes sense because CoreML mainly uses Apple Neural Engine (ANE). I tried to force CoreML only runs CPU or GPU, but the performance was terribly worse.
I am developing Metal backend to replace OpenCL backend on new Mac systems. Then, I am going to combine Metal backend and CoreML backend, so that it can run both of GPU and ANE at the same time.
Not a technical comment but this is really exciting to see native support getting close for M1.
The output of Metal backend looks correct, but I haven't optimized the performance. Metal runs slower than OpenCL backend at this moment, so I am going to tune its performance.
I released a KataGo v1.11.0-coreml3
that can run GPU with Metal Performance Shaders Graph, and run Neural Engine (NE) with CoreML in the release page below:
Release https://github.com/ChinChangYang/KataGo/releases/tag/v1.11.0-coreml3
OpenCL (v1.11.0) benchmark result
Search thread Visits/s NnEvals/s Avg. batch
8 151.33 125.48 3.97
10 160.28 131.54 4.94
12 160.49 136.40 5.93
OpenCL + CoreML (v1.11.0-coreml1) benchmark result
Search thread Visits/s NnEvals/s Avg. batch
10 290.46 242.10 2.62
12 300.27 250.39 3.25
16 292.09 249.47 3.66
CoreML (v1.11.0-coreml2) benchmark result
Search thread Visits/s NnEvals/s Avg. batch
2 256.63 215.04 1.00
3 261.11 214.75 1.00
4 255.97 214.92 1.25
Metal + CoreML (v1.11.0-coreml3) benchmark result
Search thread Visits/s NnEvals/s Avg. batch
3 326.09 270.14 1.00
6 325.75 270.67 1.30
9 322.06 270.51 1.70
Performance issues
Though "Metal+CoreML" performs better than "OpenCL+CoreML", the performance of Metal-only backend is significantly worse (about 120 visits/s) than the performance of OpenCL backend (about 160 visits/s). I had spent much time on evaluating different design of Metal backend, but the best performance of Metal backend I could get is about 120 visits/s.
Besides, the neural engine sometimes stops running when GPU is very busy. I don't know why it happens, but it seems to me that CoreML runs GPU as well.
Future work
Benchmark result of my Apple M1 MacBook Air (2020) macOS Ventura 13.0.1:
./katago benchmark -model kata1-b40c256-s11840935168-d2898845681.bin.gz -config ../../../../../../configs/misc/coreml_example.cfg
2022-11-27 12:14:44+0800: Running with following config:
allowResignation = true
coremlDeviceToUseThread0 = 0
coremlDeviceToUseThread1 = 100
coremlDeviceToUseThread2 = 101
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 3
numSearchThreads = 3
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2022-11-27 12:14:44+0800: Loading model and initializing benchmark...
2022-11-27 12:14:44+0800: Testing with default positions for board size: 19
2022-11-27 12:14:44+0800: nnRandSeed0 = 5898672391427125252
2022-11-27 12:14:44+0800: After dedups: nnModelFile0 = kata1-b40c256-s11840935168-d2898845681.bin.gz useFP16 auto useNHWC auto
2022-11-27 12:14:44+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-11-27 12:14:44.771 katago[83063:5335758] Metal backend thread -1: CoreML-#100-19x19
2022-11-27 12:14:44.774 katago[83063:5335758] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-27 12:14:49.220 katago[83063:5335758] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
2022-11-27 12:14:50.272 katago[83063:5335833] Metal backend thread 1: CoreML-#100-19x19
2022-11-27 12:14:50.272 katago[83063:5335834] Metal backend thread 2: CoreML-#101-19x19
2022-11-27 12:14:50.272 katago[83063:5335834] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-27 12:14:50.280 katago[83063:5335832] Metal backend thread 0: Apple M1 Model version 10
2022-11-27 12:14:50.280 katago[83063:5335832] Metal backend thread 0: Apple M1 Model name kata1-b40c256-s11840935168-d2898845681
2022-11-27 12:14:50.445 katago[83063:5335832] Metal backend thread 0: Apple M1 useFP16=false useNHWC=false batchSize=1
2022-11-27 12:14:54.430 katago[83063:5335834] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
2022-11-27 12:15:12+0800: Loaded config ../../../../../../configs/misc/coreml_example.cfg
2022-11-27 12:15:12+0800: Loaded model kata1-b40c256-s11840935168-d2898845681.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the CoreML version of KataGo.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-11-27 12:15:12+0800: GPU 101 finishing, processed 1 rows 1 batches
2022-11-27 12:15:12+0800: GPU 100 finishing, processed 2 rows 2 batches
2022-11-27 12:15:12+0800: GPU 0 finishing, processed 2 rows 2 batches
2022-11-27 12:15:12+0800: nnRandSeed0 = 860939090915830925
2022-11-27 12:15:12+0800: After dedups: nnModelFile0 = kata1-b40c256-s11840935168-d2898845681.bin.gz useFP16 auto useNHWC auto
2022-11-27 12:15:12+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-11-27 12:15:12.461 katago[83063:5335758] Metal backend thread -1: CoreML-#100-19x19
2022-11-27 12:15:12.461 katago[83063:5335758] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-27 12:15:16.554 katago[83063:5335758] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
2022-11-27 12:15:17.595 katago[83063:5336050] Metal backend thread 2: CoreML-#101-19x19
2022-11-27 12:15:17.595 katago[83063:5336049] Metal backend thread 1: CoreML-#100-19x19
2022-11-27 12:15:17.595 katago[83063:5336050] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-27 12:15:17.595 katago[83063:5336048] Metal backend thread 0: Apple M1 Model version 10
2022-11-27 12:15:17.595 katago[83063:5336048] Metal backend thread 0: Apple M1 Model name kata1-b40c256-s11840935168-d2898845681
2022-11-27 12:15:17.651 katago[83063:5336048] Metal backend thread 0: Apple M1 useFP16=false useNHWC=false batchSize=1
2022-11-27 12:15:21.615 katago[83063:5336050] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
Possible numbers of threads to test: 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48, 64,
numSearchThreads = 8: 10 / 10 positions, visits/s = 271.69 nnEvals/s = 234.93 nnBatches/s = 144.06 avgBatchSize = 1.63 (29.7 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 225.70 nnEvals/s = 204.36 nnBatches/s = 70.82 avgBatchSize = 2.89 (36.5 secs)
numSearchThreads = 5: 10 / 10 positions, visits/s = 137.86 nnEvals/s = 116.75 nnBatches/s = 100.00 avgBatchSize = 1.17 (58.3 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 124.16 nnEvals/s = 107.24 nnBatches/s = 52.16 avgBatchSize = 2.06 (65.3 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 128.91 nnEvals/s = 106.66 nnBatches/s = 106.36 avgBatchSize = 1.00 (62.3 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 213.99 nnEvals/s = 174.99 nnBatches/s = 131.50 avgBatchSize = 1.33 (37.6 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 240.09 nnEvals/s = 200.68 nnBatches/s = 103.16 avgBatchSize = 1.95 (33.7 secs)
Ordered summary of results:
numSearchThreads = 4: 10 / 10 positions, visits/s = 128.91 nnEvals/s = 106.66 nnBatches/s = 106.36 avgBatchSize = 1.00 (62.3 secs) (EloDiff baseline)
numSearchThreads = 5: 10 / 10 positions, visits/s = 137.86 nnEvals/s = 116.75 nnBatches/s = 100.00 avgBatchSize = 1.17 (58.3 secs) (EloDiff +18)
numSearchThreads = 6: 10 / 10 positions, visits/s = 213.99 nnEvals/s = 174.99 nnBatches/s = 131.50 avgBatchSize = 1.33 (37.6 secs) (EloDiff +177)
numSearchThreads = 8: 10 / 10 positions, visits/s = 271.69 nnEvals/s = 234.93 nnBatches/s = 144.06 avgBatchSize = 1.63 (29.7 secs) (EloDiff +256)
numSearchThreads = 10: 10 / 10 positions, visits/s = 240.09 nnEvals/s = 200.68 nnBatches/s = 103.16 avgBatchSize = 1.95 (33.7 secs) (EloDiff +197)
numSearchThreads = 12: 10 / 10 positions, visits/s = 124.16 nnEvals/s = 107.24 nnBatches/s = 52.16 avgBatchSize = 2.06 (65.3 secs) (EloDiff -76)
numSearchThreads = 24: 10 / 10 positions, visits/s = 225.70 nnEvals/s = 204.36 nnBatches/s = 70.82 avgBatchSize = 2.89 (36.5 secs) (EloDiff +89)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 4: (baseline)
numSearchThreads = 5: +18 Elo
numSearchThreads = 6: +177 Elo
numSearchThreads = 8: +256 Elo (recommended)
numSearchThreads = 10: +197 Elo
numSearchThreads = 12: -76 Elo
numSearchThreads = 24: +89 Elo
If you care about performance, you may want to edit numSearchThreads in ../../../../../../configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../../../../../../configs/misc/coreml_example.cfg
2022-11-27 12:21:02+0800: GPU 0 finishing, processed 9173 rows 6288 batches
2022-11-27 12:21:02+0800: GPU 101 finishing, processed 19543 rows 12397 batches
2022-11-27 12:21:02+0800: GPU 100 finishing, processed 19518 rows 12467 batches
I released a KataGo
v1.11.0-coreml3
that can run GPU with Metal Performance Shaders Graph, and run Neural Engine (NE) with CoreML in the release page below:Release https://github.com/ChinChangYang/KataGo/releases/tag/v1.11.0-coreml3
OpenCL (v1.11.0) benchmark result
Search thread Visits/s NnEvals/s Avg. batch 8 151.33 125.48 3.97 10 160.28 131.54 4.94 12 160.49 136.40 5.93
OpenCL + CoreML (v1.11.0-coreml1) benchmark result
Search thread Visits/s NnEvals/s Avg. batch 10 290.46 242.10 2.62 12 300.27 250.39 3.25 16 292.09 249.47 3.66
CoreML (v1.11.0-coreml2) benchmark result
Search thread Visits/s NnEvals/s Avg. batch 2 256.63 215.04 1.00 3 261.11 214.75 1.00 4 255.97 214.92 1.25
Metal + CoreML (v1.11.0-coreml3) benchmark result
Search thread Visits/s NnEvals/s Avg. batch 3 326.09 270.14 1.00 6 325.75 270.67 1.30 9 322.06 270.51 1.70
Performance issues
Though "Metal+CoreML" performs better than "OpenCL+CoreML", the performance of Metal-only backend is significantly worse (about 120 visits/s) than the performance of OpenCL backend (about 160 visits/s). I had spent much time on evaluating different design of Metal backend, but the best performance of Metal backend I could get is about 120 visits/s.
Besides, the neural engine sometimes stops running when GPU is very busy. I don't know why it happens, but it seems to me that CoreML runs GPU as well.
Future work
- Improve performance of Metal backend by other design
- Support the next version of model (nested residual bottleneck blocks)
Tested on my Macbook pro 2021 16' m1 max 64GB. Got a result as below, it's a excellent work.
Possible numbers of threads to test: 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48, 64,
numSearchThreads = 8: 10 / 10 positions, visits/s = 389.15 nnEvals/s = 320.05 nnBatches/s = 197.37 avgBatchSize = 1.62 (20.7 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 365.06 nnEvals/s = 315.01 nnBatches/s = 103.12 avgBatchSize = 3.05 (22.5 secs)
numSearchThreads = 5: 10 / 10 positions, visits/s = 384.04 nnEvals/s = 320.61 nnBatches/s = 274.37 avgBatchSize = 1.17 (20.9 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 368.07 nnEvals/s = 319.00 nnBatches/s = 156.87 avgBatchSize = 2.03 (22.0 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 385.28 nnEvals/s = 319.40 nnBatches/s = 316.43 avgBatchSize = 1.01 (20.8 secs)
numSearchThreads = 3: 10 / 10 positions, visits/s = 382.09 nnEvals/s = 319.92 nnBatches/s = 319.49 avgBatchSize = 1.00 (21.0 secs)
Ordered summary of results:
numSearchThreads = 3: 10 / 10 positions, visits/s = 382.09 nnEvals/s = 319.92 nnBatches/s = 319.49 avgBatchSize = 1.00 (21.0 secs) (EloDiff baseline)
numSearchThreads = 4: 10 / 10 positions, visits/s = 385.28 nnEvals/s = 319.40 nnBatches/s = 316.43 avgBatchSize = 1.01 (20.8 secs) (EloDiff -1)
numSearchThreads = 5: 10 / 10 positions, visits/s = 384.04 nnEvals/s = 320.61 nnBatches/s = 274.37 avgBatchSize = 1.17 (20.9 secs) (EloDiff -7)
numSearchThreads = 8: 10 / 10 positions, visits/s = 389.15 nnEvals/s = 320.05 nnBatches/s = 197.37 avgBatchSize = 1.62 (20.7 secs) (EloDiff -15)
numSearchThreads = 12: 10 / 10 positions, visits/s = 368.07 nnEvals/s = 319.00 nnBatches/s = 156.87 avgBatchSize = 2.03 (22.0 secs) (EloDiff -55)
numSearchThreads = 24: 10 / 10 positions, visits/s = 365.06 nnEvals/s = 315.01 nnBatches/s = 103.12 avgBatchSize = 3.05 (22.5 secs) (EloDiff -113)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 3: (baseline) (recommended)
numSearchThreads = 4: -1 Elo
numSearchThreads = 5: -7 Elo
numSearchThreads = 8: -15 Elo
numSearchThreads = 12: -55 Elo
numSearchThreads = 24: -113 Elo
Apple M2 MacBook Air (2022) macOS Ventura 13.0.1:
./katago benchmark -model kata1-b40c256-s11840935168-d2898845681.bin.gz -config ../../../../../../configs/misc/coreml_example.cfg
2022-11-29 12:24:09+0800: Running with following config:
allowResignation = true
coremlDeviceToUseThread0 = 0
coremlDeviceToUseThread1 = 100
coremlDeviceToUseThread2 = 101
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 3
numSearchThreads = 3
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2022-11-29 12:24:09+0800: Loading model and initializing benchmark...
2022-11-29 12:24:09+0800: Testing with default positions for board size: 19
2022-11-29 12:24:09+0800: nnRandSeed0 = 2934192115712296623
2022-11-29 12:24:09+0800: After dedups: nnModelFile0 = kata1-b40c256-s11840935168-d2898845681.bin.gz useFP16 auto useNHWC auto
2022-11-29 12:24:09+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-11-29 12:24:09.923 katago[76509:8606275] Metal backend thread -1: CoreML-#100-19x19
2022-11-29 12:24:09.925 katago[76509:8606275] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-29 12:24:13.812 katago[76509:8606275] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
2022-11-29 12:24:14.748 katago[76509:8606345] Metal backend thread 2: CoreML-#101-19x19
2022-11-29 12:24:14.748 katago[76509:8606344] Metal backend thread 1: CoreML-#100-19x19
2022-11-29 12:24:14.749 katago[76509:8606345] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-29 12:24:14.753 katago[76509:8606343] Metal backend thread 0: Apple M2 Model version 10
2022-11-29 12:24:14.753 katago[76509:8606343] Metal backend thread 0: Apple M2 Model name kata1-b40c256-s11840935168-d2898845681
2022-11-29 12:24:14.822 katago[76509:8606343] Metal backend thread 0: Apple M2 useFP16=false useNHWC=false batchSize=1
2022-11-29 12:24:18.353 katago[76509:8606345] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
2022-11-29 12:24:35+0800: Loaded config ../../../../../../configs/misc/coreml_example.cfg
2022-11-29 12:24:35+0800: Loaded model kata1-b40c256-s11840935168-d2898845681.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the CoreML version of KataGo.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-11-29 12:24:36+0800: GPU 100 finishing, processed 1 rows 1 batches
2022-11-29 12:24:36+0800: GPU 0 finishing, processed 2 rows 2 batches
2022-11-29 12:24:36+0800: GPU 101 finishing, processed 2 rows 2 batches
2022-11-29 12:24:36+0800: nnRandSeed0 = 16354575094665057778
2022-11-29 12:24:36+0800: After dedups: nnModelFile0 = kata1-b40c256-s11840935168-d2898845681.bin.gz useFP16 auto useNHWC auto
2022-11-29 12:24:36+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-11-29 12:24:36.063 katago[76509:8606275] Metal backend thread -1: CoreML-#100-19x19
2022-11-29 12:24:36.064 katago[76509:8606275] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-29 12:24:39.818 katago[76509:8606275] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
2022-11-29 12:24:40.783 katago[76509:8606701] Metal backend thread 1: CoreML-#100-19x19
2022-11-29 12:24:40.783 katago[76509:8606702] Metal backend thread 2: CoreML-#101-19x19
2022-11-29 12:24:40.783 katago[76509:8606700] Metal backend thread 0: Apple M2 Model version 10
2022-11-29 12:24:40.783 katago[76509:8606702] INFO: Loading KataGo Model from file:///Users/ohho/Codes/AI/Katago/katago-coreml/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19.mlmodel
2022-11-29 12:24:40.783 katago[76509:8606700] Metal backend thread 0: Apple M2 Model name kata1-b40c256-s11840935168-d2898845681
2022-11-29 12:24:40.823 katago[76509:8606700] Metal backend thread 0: Apple M2 useFP16=false useNHWC=false batchSize=1
2022-11-29 12:24:44.260 katago[76509:8606702] Loaded KataGo Model: KataGo 19x19 model version 10 converted from kata1-b40c256-s11840935168-d2898845681/saved_model/model.config.json
Possible numbers of threads to test: 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48, 64,
numSearchThreads = 8: 10 / 10 positions, visits/s = 393.44 nnEvals/s = 324.59 nnBatches/s = 194.26 avgBatchSize = 1.67 (20.5 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 330.90 nnEvals/s = 294.67 nnBatches/s = 112.19 avgBatchSize = 2.63 (24.9 secs)
numSearchThreads = 5: 10 / 10 positions, visits/s = 360.85 nnEvals/s = 306.54 nnBatches/s = 263.32 avgBatchSize = 1.16 (22.3 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 349.09 nnEvals/s = 299.10 nnBatches/s = 140.79 avgBatchSize = 2.12 (23.2 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 342.61 nnEvals/s = 286.50 nnBatches/s = 284.79 avgBatchSize = 1.01 (23.4 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 360.91 nnEvals/s = 299.80 nnBatches/s = 226.41 avgBatchSize = 1.32 (22.3 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 332.49 nnEvals/s = 285.72 nnBatches/s = 145.65 avgBatchSize = 1.96 (24.3 secs)
Ordered summary of results:
numSearchThreads = 4: 10 / 10 positions, visits/s = 342.61 nnEvals/s = 286.50 nnBatches/s = 284.79 avgBatchSize = 1.01 (23.4 secs) (EloDiff baseline)
numSearchThreads = 5: 10 / 10 positions, visits/s = 360.85 nnEvals/s = 306.54 nnBatches/s = 263.32 avgBatchSize = 1.16 (22.3 secs) (EloDiff +15)
numSearchThreads = 6: 10 / 10 positions, visits/s = 360.91 nnEvals/s = 299.80 nnBatches/s = 226.41 avgBatchSize = 1.32 (22.3 secs) (EloDiff +10)
numSearchThreads = 8: 10 / 10 positions, visits/s = 393.44 nnEvals/s = 324.59 nnBatches/s = 194.26 avgBatchSize = 1.67 (20.5 secs) (EloDiff +34)
numSearchThreads = 10: 10 / 10 positions, visits/s = 332.49 nnEvals/s = 285.72 nnBatches/s = 145.65 avgBatchSize = 1.96 (24.3 secs) (EloDiff -40)
numSearchThreads = 12: 10 / 10 positions, visits/s = 349.09 nnEvals/s = 299.10 nnBatches/s = 140.79 avgBatchSize = 2.12 (23.2 secs) (EloDiff -31)
numSearchThreads = 24: 10 / 10 positions, visits/s = 330.90 nnEvals/s = 294.67 nnBatches/s = 112.19 avgBatchSize = 2.63 (24.9 secs) (EloDiff -110)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 4: (baseline)
numSearchThreads = 5: +15 Elo
numSearchThreads = 6: +10 Elo
numSearchThreads = 8: +34 Elo (recommended)
numSearchThreads = 10: -40 Elo
numSearchThreads = 12: -31 Elo
numSearchThreads = 24: -110 Elo
If you care about performance, you may want to edit numSearchThreads in ../../../../../../configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../../../../../../configs/misc/coreml_example.cfg
2022-11-29 12:27:41+0800: GPU 100 finishing, processed 20259 rows 13113 batches
2022-11-29 12:27:41+0800: GPU 101 finishing, processed 20267 rows 13168 batches
2022-11-29 12:27:41+0800: GPU 0 finishing, processed 7595 rows 4904 batches
Hi and thank you for your work! Total MacOS beginner here, but managed to get all the versions working and sharing my benchmark results. (After figuring out all the xcode, wget and cmake stuff...) (14" Macbook M1 Pro 8-core 16GB, Ventura 13.1) Network: kata1-b40c256-s11840935168-d2898845681.bin.gz
Stupid question, can CoreML3 be used for Analyzing games somehow?
OpenCL:
Ordered summary of results:
numSearchThreads = 5: 10 / 10 positions, visits/s = 123.19 nnEvals/s = 105.37 nnBatches/s = 42.32 avgBatchSize = 2.49 (65.3 secs) (EloDiff baseline)
numSearchThreads = 8: 10 / 10 positions, visits/s = 160.73 nnEvals/s = 132.53 nnBatches/s = 33.41 avgBatchSize = 3.97 (50.2 secs) (EloDiff +79)
numSearchThreads = 10: 10 / 10 positions, visits/s = 166.39 nnEvals/s = 137.36 nnBatches/s = 27.76 avgBatchSize = 4.95 (48.6 secs) (EloDiff +78)
numSearchThreads = 12: 10 / 10 positions, visits/s = 178.45 nnEvals/s = 149.41 nnBatches/s = 25.20 avgBatchSize = 5.93 (45.4 secs) (EloDiff +92)
numSearchThreads = 16: 10 / 10 positions, visits/s = 192.48 nnEvals/s = 160.77 nnBatches/s = 20.44 avgBatchSize = 7.86 (42.3 secs) (EloDiff +97)
numSearchThreads = 20: 10 / 10 positions, visits/s = 188.24 nnEvals/s = 165.16 nnBatches/s = 16.94 avgBatchSize = 9.75 (43.5 secs) (EloDiff +61)
CoreML1:
Ordered summary of results:
numSearchThreads = 1: 10 / 10 positions, visits/s = 222.36 nnEvals/s = 182.95 nnBatches/s = 182.95 avgBatchSize = 1.00 (36.0 secs) (EloDiff baseline)
numSearchThreads = 2: 10 / 10 positions, visits/s = 223.93 nnEvals/s = 186.94 nnBatches/s = 186.91 avgBatchSize = 1.00 (35.8 secs) (EloDiff -3)
numSearchThreads = 3: 10 / 10 positions, visits/s = 225.64 nnEvals/s = 186.64 nnBatches/s = 124.58 avgBatchSize = 1.50 (35.5 secs) (EloDiff -7)
numSearchThreads = 4: 10 / 10 positions, visits/s = 228.36 nnEvals/s = 186.47 nnBatches/s = 93.50 avgBatchSize = 1.99 (35.2 secs) (EloDiff -8)
numSearchThreads = 5: 10 / 10 positions, visits/s = 225.46 nnEvals/s = 186.23 nnBatches/s = 74.79 avgBatchSize = 2.49 (35.7 secs) (EloDiff -19)
numSearchThreads = 6: 10 / 10 positions, visits/s = 222.39 nnEvals/s = 186.37 nnBatches/s = 62.44 avgBatchSize = 2.98 (36.2 secs) (EloDiff -30)
numSearchThreads = 12: 10 / 10 positions, visits/s = 223.01 nnEvals/s = 186.18 nnBatches/s = 31.38 avgBatchSize = 5.93 (36.4 secs) (EloDiff -65)
CoreML2:
Ordered summary of results:
numSearchThreads = 2: 10 / 10 positions, visits/s = 258.23 nnEvals/s = 215.19 nnBatches/s = 215.19 avgBatchSize = 1.00 (31.0 secs) (EloDiff baseline)
numSearchThreads = 3: 10 / 10 positions, visits/s = 266.04 nnEvals/s = 215.38 nnBatches/s = 215.28 avgBatchSize = 1.00 (30.1 secs) (EloDiff +5)
numSearchThreads = 4: 10 / 10 positions, visits/s = 258.99 nnEvals/s = 215.28 nnBatches/s = 172.29 avgBatchSize = 1.25 (31.0 secs) (EloDiff -10)
numSearchThreads = 6: 10 / 10 positions, visits/s = 268.33 nnEvals/s = 215.13 nnBatches/s = 140.53 avgBatchSize = 1.53 (30.0 secs) (EloDiff -8)
numSearchThreads = 10: 10 / 10 positions, visits/s = 251.59 nnEvals/s = 214.83 nnBatches/s = 91.90 avgBatchSize = 2.34 (32.2 secs) (EloDiff -55)
numSearchThreads = 20: 10 / 10 positions, visits/s = 248.13 nnEvals/s = 213.89 nnBatches/s = 52.54 avgBatchSize = 4.07 (33.0 secs) (EloDiff -117)
CoreML3:
Ordered summary of results:
numSearchThreads = 3: 10 / 10 positions, visits/s = 326.80 nnEvals/s = 269.63 nnBatches/s = 269.30 avgBatchSize = 1.00 (24.5 secs) (EloDiff baseline)
numSearchThreads = 4: 10 / 10 positions, visits/s = 329.84 nnEvals/s = 270.69 nnBatches/s = 269.82 avgBatchSize = 1.00 (24.3 secs) (EloDiff -1)
numSearchThreads = 5: 10 / 10 positions, visits/s = 328.57 nnEvals/s = 269.88 nnBatches/s = 230.94 avgBatchSize = 1.17 (24.5 secs) (EloDiff -8)
numSearchThreads = 8: 10 / 10 positions, visits/s = 313.44 nnEvals/s = 269.20 nnBatches/s = 175.90 avgBatchSize = 1.53 (25.7 secs) (EloDiff -40)
numSearchThreads = 12: 10 / 10 positions, visits/s = 313.71 nnEvals/s = 268.61 nnBatches/s = 133.38 avgBatchSize = 2.01 (25.9 secs) (EloDiff -60)
numSearchThreads = 24: 10 / 10 positions, visits/s = 304.94 nnEvals/s = 268.21 nnBatches/s = 87.48 avgBatchSize = 3.07 (27.0 secs) (EloDiff -133)
Benchmark of v1.12.4-coreml1 on Apple M2 MacBook Air (2022) macOS Ventura 13.2.1:
./katago benchmark -model kata1-b18c384nbt-uec.bin.gz -config ../../../../../../configs/misc/coreml_example.cfg
2023-03-24 17:53:45+0800: Running with following config:
allowResignation = true
coremlDeviceToUseThread0 = 0
coremlDeviceToUseThread1 = 100
coremlDeviceToUseThread2 = 101
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 3
numSearchThreads = 3
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2023-03-24 17:53:45+0800: Loading model and initializing benchmark...
2023-03-24 17:53:45+0800: Testing with default positions for board size: 19
2023-03-24 17:53:45+0800: nnRandSeed0 = 17030101136292765193
2023-03-24 17:53:45+0800: After dedups: nnModelFile0 = kata1-b18c384nbt-uec.bin.gz useFP16 auto useNHWC auto
2023-03-24 17:53:45+0800: Initializing neural net buffer to be size 19 * 19 exactly
2023-03-24 17:53:46.058 katago[61239:16172718] INFO: Compiling model at file:///Users/ohho/Downloads/Katago/KataGo-1.12.4-coreml1/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19fp16v11.mlpackage/
2023-03-24 17:53:46.067 katago[61239:16172717] Metal backend thread 0: Apple M2 Model version 11
2023-03-24 17:53:46.067 katago[61239:16172717] Metal backend thread 0: Apple M2 Model name b18c384nbt-uec-20221121b
2023-03-24 17:53:46.139 katago[61239:16172717] Metal backend thread 0: Apple M2 useFP16=false useNHWC=false batchSize=1
2023-03-24 17:53:46.559 katago[61239:16172718] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16v11.mlmodelc
2023-03-24 17:53:58.252 katago[61239:16172718] INFO: Created model: KataGo 19x19 compute precision fp16 model version 11 converted from b18c384nbt-uec-20221121b.ckpt
2023-03-24 17:53:58.252 katago[61239:16172718] CoreML backend thread 1: #0-19x19 useFP16 1
2023-03-24 17:53:58.252 katago[61239:16172719] INFO: Compiling model at file:///Users/ohho/Downloads/Katago/KataGo-1.12.4-coreml1/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19fp16v11.mlpackage/
2023-03-24 17:53:58.723 katago[61239:16172719] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16v11_11C1839F-5D33-4A6C-97D4-BE518B82DBDD.mlmodelc
2023-03-24 17:54:10.574 katago[61239:16172719] INFO: Created model: KataGo 19x19 compute precision fp16 model version 11 converted from b18c384nbt-uec-20221121b.ckpt
2023-03-24 17:54:10.574 katago[61239:16172719] CoreML backend thread 2: #1-19x19 useFP16 1
2023-03-24 17:54:13+0800: Loaded config ../../../../../../configs/misc/coreml_example.cfg
2023-03-24 17:54:13+0800: Loaded model kata1-b18c384nbt-uec.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the CoreML version of KataGo.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2023-03-24 17:54:13+0800: GPU 101 finishing, processed 1 rows 1 batches
2023-03-24 17:54:13+0800: GPU 0 finishing, processed 2 rows 2 batches
2023-03-24 17:54:13+0800: GPU 100 finishing, processed 2 rows 2 batches
2023-03-24 17:54:13+0800: nnRandSeed0 = 844083162673476441
2023-03-24 17:54:13+0800: After dedups: nnModelFile0 = kata1-b18c384nbt-uec.bin.gz useFP16 auto useNHWC auto
2023-03-24 17:54:13+0800: Initializing neural net buffer to be size 19 * 19 exactly
2023-03-24 17:54:14.294 katago[61239:16173407] Metal backend thread 0: Apple M2 Model version 11
2023-03-24 17:54:14.294 katago[61239:16173407] Metal backend thread 0: Apple M2 Model name b18c384nbt-uec-20221121b
2023-03-24 17:54:14.294 katago[61239:16173409] INFO: Compiling model at file:///Users/ohho/Downloads/Katago/KataGo-1.12.4-coreml1/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19fp16v11.mlpackage/
2023-03-24 17:54:14.338 katago[61239:16173407] Metal backend thread 0: Apple M2 useFP16=false useNHWC=false batchSize=1
2023-03-24 17:54:14.772 katago[61239:16173409] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16v11_66A774CA-D206-4BF8-97C8-A360D56E1753.mlmodelc
2023-03-24 17:54:26.395 katago[61239:16173409] INFO: Created model: KataGo 19x19 compute precision fp16 model version 11 converted from b18c384nbt-uec-20221121b.ckpt
2023-03-24 17:54:26.395 katago[61239:16173409] CoreML backend thread 2: #2-19x19 useFP16 1
2023-03-24 17:54:26.395 katago[61239:16173408] INFO: Compiling model at file:///Users/ohho/Downloads/Katago/KataGo-1.12.4-coreml1/cpp/xcode/DerivedData/KataGo/Build/Products/Release/KataGoModel19x19fp16v11.mlpackage/
2023-03-24 17:54:26.867 katago[61239:16173408] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16v11_3A001FD8-6FB1-4110-9BA1-947E9A25728D.mlmodelc
2023-03-24 17:54:38.504 katago[61239:16173408] INFO: Created model: KataGo 19x19 compute precision fp16 model version 11 converted from b18c384nbt-uec-20221121b.ckpt
2023-03-24 17:54:38.504 katago[61239:16173408] CoreML backend thread 1: #3-19x19 useFP16 1
Possible numbers of threads to test: 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48, 64,
numSearchThreads = 8: 10 / 10 positions, visits/s = 285.93 nnEvals/s = 243.27 nnBatches/s = 145.37 avgBatchSize = 1.67 (28.2 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 279.33 nnEvals/s = 245.17 nnBatches/s = 76.27 avgBatchSize = 3.21 (29.4 secs)
numSearchThreads = 5: 10 / 10 positions, visits/s = 276.79 nnEvals/s = 236.24 nnBatches/s = 202.67 avgBatchSize = 1.17 (29.0 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 263.05 nnEvals/s = 225.49 nnBatches/s = 98.88 avgBatchSize = 2.28 (30.8 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 224.16 nnEvals/s = 191.11 nnBatches/s = 190.50 avgBatchSize = 1.00 (35.8 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 206.34 nnEvals/s = 174.92 nnBatches/s = 130.65 avgBatchSize = 1.34 (39.0 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 210.37 nnEvals/s = 178.87 nnBatches/s = 95.34 avgBatchSize = 1.88 (38.5 secs)
Ordered summary of results:
numSearchThreads = 4: 10 / 10 positions, visits/s = 224.16 nnEvals/s = 191.11 nnBatches/s = 190.50 avgBatchSize = 1.00 (35.8 secs) (EloDiff baseline)
numSearchThreads = 5: 10 / 10 positions, visits/s = 276.79 nnEvals/s = 236.24 nnBatches/s = 202.67 avgBatchSize = 1.17 (29.0 secs) (EloDiff +73)
numSearchThreads = 6: 10 / 10 positions, visits/s = 206.34 nnEvals/s = 174.92 nnBatches/s = 130.65 avgBatchSize = 1.34 (39.0 secs) (EloDiff -43)
numSearchThreads = 8: 10 / 10 positions, visits/s = 285.93 nnEvals/s = 243.27 nnBatches/s = 145.37 avgBatchSize = 1.67 (28.2 secs) (EloDiff +70)
numSearchThreads = 10: 10 / 10 positions, visits/s = 210.37 nnEvals/s = 178.87 nnBatches/s = 95.34 avgBatchSize = 1.88 (38.5 secs) (EloDiff -61)
numSearchThreads = 12: 10 / 10 positions, visits/s = 263.05 nnEvals/s = 225.49 nnBatches/s = 98.88 avgBatchSize = 2.28 (30.8 secs) (EloDiff +15)
numSearchThreads = 24: 10 / 10 positions, visits/s = 279.33 nnEvals/s = 245.17 nnBatches/s = 76.27 avgBatchSize = 3.21 (29.4 secs) (EloDiff -25)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 4: (baseline)
numSearchThreads = 5: +73 Elo (recommended)
numSearchThreads = 6: -43 Elo
numSearchThreads = 8: +70 Elo
numSearchThreads = 10: -61 Elo
numSearchThreads = 12: +15 Elo
numSearchThreads = 24: -25 Elo
If you care about performance, you may want to edit numSearchThreads in ../../../../../../configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../../../../../../configs/misc/coreml_example.cfg
2023-03-24 17:58:31+0800: GPU 0 finishing, processed 12693 rows 8157 batches
2023-03-24 17:58:31+0800: GPU 101 finishing, processed 17891 rows 11331 batches
2023-03-24 17:58:31+0800: GPU 100 finishing, processed 17868 rows 11388 batches
I have a problem while converting the coreml model. v1.12.4-coreml1 mlmodel = ct.convert( ^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/_converters_entry.py", line 492, in convert mlmodel = mil_convert( ^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 188, in mil_convert return _mil_convert(model, convert_from, convert_to, ConverterRegistry, MLModel, compute_units, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 212, in _mil_convert proto, mil_program = mil_convert_to_proto( ^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 303, in mil_convert_to_proto out = backend_converter(prog, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 130, in call return backend_load(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/backend/mil/load.py", line 283, in load raise RuntimeError("BlobWriter not loaded")
To have a consistent environment of coremltools
, please follow the steps below:
Then, the terminal shows the following message when converting the CoreML model:
% python convert_coreml_pytorch.py -checkpoint b18c384nbt-uec-20221121b.ckpt -use-swa
torch version: 2.0.0
coremltools version: 6.3.0
Using coremlmish function: mish_torch_ne
Using coremllogsumexp
Using model: AveragedModel
Tracing model ...
Converting model ...
WARNING:coremltools:Tuple detected at graph output. This will be flattened in the converted model.
Converting PyTorch Frontend ==> MIL Ops: 100%|███████████████████████████████████████████████████████████████████████████████▊| 1538/1541 [00:00<00:00, 3829.70 ops/s]
Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 155.49 passes/s]
Running MIL default pipeline: 7%|██████▋ | 4/57 [00:00<00:02, 25.36 passes/s]/Users/chinchangyang/miniconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:234: UserWarning: Input, 'input.1', of the source model, has been renamed to 'input_1' in the Core ML model.
warnings.warn(msg.format(var.name, new_name))
/Users/chinchangyang/miniconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:262: UserWarning: Output, '2462', of the source model, has been renamed to 'var_2462' in the Core ML model.
warnings.warn(msg.format(var.name, new_name))
/Users/chinchangyang/miniconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:262: UserWarning: Output, '2503', of the source model, has been renamed to 'var_2503' in the Core ML model.
warnings.warn(msg.format(var.name, new_name))
/Users/chinchangyang/miniconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:262: UserWarning: Output, '2506', of the source model, has been renamed to 'var_2506' in the Core ML model.
warnings.warn(msg.format(var.name, new_name))
/Users/chinchangyang/miniconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:262: UserWarning: Output, '2509', of the source model, has been renamed to 'var_2509' in the Core ML model.
warnings.warn(msg.format(var.name, new_name))
/Users/chinchangyang/miniconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:262: UserWarning: Output, '2514', of the source model, has been renamed to 'var_2514' in the Core ML model.
warnings.warn(msg.format(var.name, new_name))
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:33<00:00, 1.69 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 146.64 passes/s]
Input names: ['input_spatial', 'input_global']
Output names: ['output_policy', 'out_value', 'out_miscvalue', 'out_moremiscvalue', 'out_ownership']
Rebuilding model with updated spec ...
Saving model ...
Saved Core ML model at KataGoModel19x19fp16v11.mlpackage
Stupid question, can CoreML3 be used for Analyzing games somehow?
Yes. CoreML3 supports game analysis. Use the analysis command, specify the binary model location, and configure with the coreml_analysis.cfg file.
See this release page for more details.
Benchmark of v1.13.2-coreml1 on Apple M2 MacBook Air (2022) macOS Ventura 13.4:
./katago benchmark -model v13softplusfixcheckpoint_prev0.bin.gz -config ../cpp/configs/misc/coreml_example.cfg
2023-06-05 12:05:40+0800: Running with following config:
allowResignation = true
coremlDeviceToUseThread0 = 0
coremlDeviceToUseThread1 = 100
coremlDeviceToUseThread2 = 101
lagBuffer = 1.0
logAllGTPCommunication = true
logDir = gtp_logs
logSearchInfo = true
logToStderr = false
maxTimePondering = 60
maxVisits = 500
numNNServerThreadsPerModel = 3
numSearchThreads = 3
ponderingEnabled = false
resignConsecTurns = 3
resignThreshold = -0.90
rules = tromp-taylor
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.25
searchFactorWhenWinning = 0.40
searchFactorWhenWinningThreshold = 0.95
2023-06-05 12:05:40+0800: Loading model and initializing benchmark...
2023-06-05 12:05:40+0800: Testing with default positions for board size: 19
2023-06-05 12:05:40+0800: nnRandSeed0 = 3251216980699932600
2023-06-05 12:05:40+0800: After dedups: nnModelFile0 = v13softplusfixcheckpoint_prev0.bin.gz useFP16 auto useNHWC auto
2023-06-05 12:05:40+0800: Initializing neural net buffer to be size 19 * 19 exactly
2023-06-05 12:05:41.591 katago[23224:11396667] INFO: Compiling model at file:///Users/ohho/Games/Katago/katago-coreml/bin/KataGoModel19x19fp16.mlpackage/
2023-06-05 12:05:41.596 katago[23224:11396666] Metal backend thread 0: Apple M2 Model version 13 v13softplusfixcheckpoint_prev0
2023-06-05 12:05:41.806 katago[23224:11396667] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16.mlmodelc
2023-06-05 12:05:54.835 katago[23224:11396667] INFO: Created model: KataGo 19x19 compute precision fp16 model version 13 converted from v13softplusfixcheckpoint_prev0.ckpt
2023-06-05 12:05:54.835 katago[23224:11396667] CoreML backend thread 1: #0-19x19 useFP16 1
2023-06-05 12:05:54.835 katago[23224:11396668] INFO: Compiling model at file:///Users/ohho/Games/Katago/katago-coreml/bin/KataGoModel19x19fp16.mlpackage/
2023-06-05 12:05:55.022 katago[23224:11396668] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16_2965B697-16C4-4ED3-B53C-BCB0ACF14AB3.mlmodelc
2023-06-05 12:06:07.539 katago[23224:11396668] INFO: Created model: KataGo 19x19 compute precision fp16 model version 13 converted from v13softplusfixcheckpoint_prev0.ckpt
2023-06-05 12:06:07.539 katago[23224:11396668] CoreML backend thread 2: #1-19x19 useFP16 1
2023-06-05 12:06:10+0800: Loaded config ../cpp/configs/misc/coreml_example.cfg
2023-06-05 12:06:10+0800: Loaded model v13softplusfixcheckpoint_prev0.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the CoreML version of KataGo.
Your GTP config is currently set to use numSearchThreads = 3
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2023-06-05 12:06:10+0800: GPU 101 finishing, processed 1 rows 1 batches
2023-06-05 12:06:10+0800: GPU 0 finishing, processed 2 rows 2 batches
2023-06-05 12:06:10+0800: GPU 100 finishing, processed 2 rows 2 batches
2023-06-05 12:06:10+0800: nnRandSeed0 = 4965308741991079887
2023-06-05 12:06:10+0800: After dedups: nnModelFile0 = v13softplusfixcheckpoint_prev0.bin.gz useFP16 auto useNHWC auto
2023-06-05 12:06:10+0800: Initializing neural net buffer to be size 19 * 19 exactly
2023-06-05 12:06:11.739 katago[23224:11397134] INFO: Compiling model at file:///Users/ohho/Games/Katago/katago-coreml/bin/KataGoModel19x19fp16.mlpackage/
2023-06-05 12:06:11.739 katago[23224:11397132] Metal backend thread 0: Apple M2 Model version 13 v13softplusfixcheckpoint_prev0
2023-06-05 12:06:11.930 katago[23224:11397134] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16_B48B70B3-ADCC-4580-B57B-C37FC7C938DD.mlmodelc
2023-06-05 12:06:24.384 katago[23224:11397134] INFO: Created model: KataGo 19x19 compute precision fp16 model version 13 converted from v13softplusfixcheckpoint_prev0.ckpt
2023-06-05 12:06:24.384 katago[23224:11397134] CoreML backend thread 2: #2-19x19 useFP16 1
2023-06-05 12:06:24.384 katago[23224:11397133] INFO: Compiling model at file:///Users/ohho/Games/Katago/katago-coreml/bin/KataGoModel19x19fp16.mlpackage/
2023-06-05 12:06:24.571 katago[23224:11397133] INFO: Creating model with contents file:///var/folders/dl/2t5pq5h11nq76mz228xj4q8r0000gn/T/KataGoModel19x19fp16_4F593316-5CF5-4E7B-958E-E4B94EC9DC7E.mlmodelc
2023-06-05 12:06:37.051 katago[23224:11397133] INFO: Created model: KataGo 19x19 compute precision fp16 model version 13 converted from v13softplusfixcheckpoint_prev0.ckpt
2023-06-05 12:06:37.051 katago[23224:11397133] CoreML backend thread 1: #3-19x19 useFP16 1
Possible numbers of threads to test: 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48, 64,
numSearchThreads = 8: 10 / 10 positions, visits/s = 248.18 nnEvals/s = 209.99 nnBatches/s = 134.05 avgBatchSize = 1.57 (32.5 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 243.35 nnEvals/s = 200.05 nnBatches/s = 54.90 avgBatchSize = 3.64 (33.8 secs)
numSearchThreads = 5: 10 / 10 positions, visits/s = 268.25 nnEvals/s = 221.81 nnBatches/s = 188.68 avgBatchSize = 1.18 (30.0 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 255.48 nnEvals/s = 216.13 nnBatches/s = 98.10 avgBatchSize = 2.20 (31.7 secs)
numSearchThreads = 4: 10 / 10 positions, visits/s = 240.12 nnEvals/s = 200.59 nnBatches/s = 199.22 avgBatchSize = 1.01 (33.4 secs)
numSearchThreads = 6: 10 / 10 positions, visits/s = 243.78 nnEvals/s = 203.54 nnBatches/s = 151.87 avgBatchSize = 1.34 (33.0 secs)
Ordered summary of results:
numSearchThreads = 4: 10 / 10 positions, visits/s = 240.12 nnEvals/s = 200.59 nnBatches/s = 199.22 avgBatchSize = 1.01 (33.4 secs) (EloDiff baseline)
numSearchThreads = 5: 10 / 10 positions, visits/s = 268.25 nnEvals/s = 221.81 nnBatches/s = 188.68 avgBatchSize = 1.18 (30.0 secs) (EloDiff +36)
numSearchThreads = 6: 10 / 10 positions, visits/s = 243.78 nnEvals/s = 203.54 nnBatches/s = 151.87 avgBatchSize = 1.34 (33.0 secs) (EloDiff -6)
numSearchThreads = 8: 10 / 10 positions, visits/s = 248.18 nnEvals/s = 209.99 nnBatches/s = 134.05 avgBatchSize = 1.57 (32.5 secs) (EloDiff -10)
numSearchThreads = 12: 10 / 10 positions, visits/s = 255.48 nnEvals/s = 216.13 nnBatches/s = 98.10 avgBatchSize = 2.20 (31.7 secs) (EloDiff -22)
numSearchThreads = 24: 10 / 10 positions, visits/s = 243.35 nnEvals/s = 200.05 nnBatches/s = 54.90 avgBatchSize = 3.64 (33.8 secs) (EloDiff -110)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 4: (baseline)
numSearchThreads = 5: +36 Elo (recommended)
numSearchThreads = 6: -6 Elo
numSearchThreads = 8: -10 Elo
numSearchThreads = 12: -22 Elo
numSearchThreads = 24: -110 Elo
If you care about performance, you may want to edit numSearchThreads in ../cpp/configs/misc/coreml_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of ../cpp/configs/misc/coreml_example.cfg
2023-06-05 12:09:53+0800: GPU 100 finishing, processed 13838 rows 9213 batches
2023-06-05 12:09:53+0800: GPU 0 finishing, processed 12823 rows 8327 batches
2023-06-05 12:09:53+0800: GPU 101 finishing, processed 13870 rows 9125 batches
compile failed on macos 14
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:71:49: error: 'Float16' is unavailable in macOS
let inputPointer = UnsafeMutablePointer~~
Swift.Float16:4:23: note: 'Float16' has been explicitly marked unavailable here
@frozen public struct Float16 {
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:73:27: error: cannot assign value of type 'Int' to subscript of type 'Float16'
inputPointer[0] = -1
^~
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:74:27: error: cannot assign value of type 'Int' to subscript of type 'Float16'
inputPointer[1] = 0
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:75:27: error: cannot assign value of type 'Int' to subscript of type 'Float16'
inputPointer[2] = 1
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:76:27: error: cannot assign value of type 'Double' to subscript of type 'Float16'
inputPointer[3] = 10.38
^~~~~
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:77:27: error: cannot assign value of type 'Double' to subscript of type 'Float16'
inputPointer[4] = 10.4
^~~~
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:93:43: error: 'Float16' is unavailable in macOS
let buffer = UnsafeMutablePointer~~
Swift.Float16:4:23: note: 'Float16' has been explicitly marked unavailable here
@frozen public struct Float16 {
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:98:30: error: cannot convert value of type 'Float16' to expected argument type 'Double'
XCTAssertEqual(buffer[0], -0.30340147018432617, accuracy: 1e-4)
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:99:30: error: cannot convert value of type 'Float16' to expected argument type 'Double'
XCTAssertEqual(buffer[1], 0.0, accuracy: 1e-4)
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:100:30: error: cannot convert value of type 'Float16' to expected argument type 'Double'
XCTAssertEqual(buffer[2], 0.8650983572006226, accuracy: 1e-4)
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:101:30: error: cannot convert value of type 'Float16' to expected argument type 'Double'
XCTAssertEqual(buffer[3], 10.380000114440918, accuracy: 1e-4)
^
/Users/dx2880/workspace/KataGo/cpp/xcode/KataGoMetalTest/metalbackendtest.swift:102:30: error: cannot convert value of type 'Float16' to expected argument type 'Double'
XCTAssertEqual(buffer[4], 10.4, accuracy: 1e-4)
The type Float16
is not supported in your MacOS. This problem has been fixed in the new commit: https://github.com/ChinChangYang/KataGo/commit/73814ffc07730b6b19bcd62f24572c9ae6898cb3
In the commit, the Float16
test has been removed for x86_64.
Hi, I am a beginner at coding just trying to follow the instructions. I use M1pro and encountered several questions.
when trying to Convert Checkpoint File to CoreML Model, it shows [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware. (coremltools) (coremltools-venv) jameswang@Jamess-MacBook-Pro-2 KataGo-1.13.2-coreml1 % python python/convert_coreml_pytorch.py -checkpoint v13softplusfixcheckpoint_prev0.ckpt -use-swa Torch version 2.1.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.1.0 is the most recent version that has been tested. /Users/jameswang/opt/anaconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/models/_deprecation.py:27: FutureWarning: Function _TORCH_OPS_REGISTRY.contains is deprecated and will be removed in 7.2.; Please use coremltools.converters.mil.frontend.torch.register_torch_op warnings.warn(msg, category=FutureWarning) /Users/jameswang/opt/anaconda3/envs/coremltools/lib/python3.8/site-packages/coremltools/models/_deprecation.py:27: FutureWarning: Function _TORCH_OPS_REGISTRY.delitem is deprecated and will be removed in 7.2.; Please use coremltools.converters.mil.frontend.torch.register_torch_op warnings.warn(msg, category=FutureWarning) torch version: 2.1.1 coremltools version: 7.1 Using coremlmish function: mish_torch_ne Using coremllogsumexp Using model: AveragedModel Model version: 13 Tracing model ... [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.
when trying to Relocate Binary and CoreML Model to Run Directory. It shows (coremltools) (coremltools-venv) jameswang@Jamess-MacBook-Pro-2 xcode % mv ../../v13softplusfixcheckpoint_prev0/v13softplusfixcheckpoint_prev0.bin.gz DerivedData/KataGo/Build/Products/Release/ mv ../../KataGoModel19x19fp16.mlpackage DerivedData/KataGo/Build/Products/Release/ mv: rename ../../v13softplusfixcheckpoint_prev0/v13softplusfixcheckpoint_prev0.bin.gz to DerivedData/KataGo/Build/Products/Release/: No such file or directory mv: rename ../../KataGoModel19x19fp16.mlpackage to DerivedData/KataGo/Build/Products/Release/: No such file or directory
when I try to benchmark, it shows (coremltools) (coremltools-venv) jameswang@Jamess-MacBook-Pro-2 xcode % ./katago benchmark -model v13softplusfixcheckpoint_prev0.bin.gz -config ../../../../../../configs/misc/coreml_example.cfg zsh: no such file or directory: ./katago
Can someone help me figure it out?
It is an environment issue. I have resolved this problem with an easier way to setup the CoreML backend.
cd cpp/xcode
xcodebuild -derivedDataPath DerivedData -scheme katago -configuration Release build
ln -s ../../../../../configs/misc/coreml_example.cfg cpp/xcode/DerivedData/Build/Products/Release/gtp.cfg
mkdir -p models
cd models
wget https://github.com/ChinChangYang/KataGo/releases/download/v1.13.2-coreml1/kata1-b18c384nbt-s7709731328-d3715293823.bin.gz
ln -s ../../../../../../models/kata1-b18c384nbt-s7709731328-d3715293823.bin.gz ../cpp/xcode/DerivedData/Build/Products/Release/model.bin.gz
mkdir -p models
cd models
wget https://github.com/ChinChangYang/KataGo/releases/download/v1.13.2-coreml1/KataGoModel19x19fp16v14s7709731328.mlpackage.zip
unzip KataGoModel19x19fp16v14s7709731328.mlpackage.zip
ln -s ../../../../../../models/KataGoModel19x19fp16v14s7709731328.mlpackage ../cpp/xcode/DerivedData/Build/Products/Release/KataGoModel19x19fp16.mlpackage
ln -s ../../../../../tests cpp/xcode/DerivedData/Build/Products/Release/tests
cd cpp/xcode
xcodebuild -derivedDataPath DerivedData -scheme katago -configuration Release test
You'll find the katago
executable in the cpp/xcode/DerivedData/Build/Products/Release
directory. These steps have been verified as effective using GitHub Actions, as demonstrated in the attached screenshot, where all checks have passed successfully.
Thanks for answering! But there is another problem.
When I try to run cd cpp/xcode xcodebuild -derivedDataPath DerivedData -scheme katago -configuration Release build
It shows jameswang@Jamess-MacBook-Pro ~ % cd cpp/xcode xcodebuild -derivedDataPath DerivedData -scheme katago -configuration Release build cd: no such file or directory: cpp/xcode Command line invocation: /Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -derivedDataPath DerivedData -scheme katago -configuration Release build
User defaults from command line: IDEDerivedDataPathOverride = /Users/jameswang/DerivedData IDEPackageSupportUseBuiltinSCM = YES
2023-11-21 18:29:16.799 xcodebuild[5998:63804] Writing error result bundle to /var/folders/k6/9_dfyff157zfdkz5pt4_vh740000gn/T/ResultBundle_2023-21-11_18-29-0016.xcresult xcodebuild: error: The directory /Users/jameswang does not contain an Xcode project, workspace or package.
Looks like you haven't completed the first step.
wget https://github.com/ChinChangYang/KataGo/archive/7cbff847aa94baefb10012b48209aaa08697af7b.zip
unzip 7cbff847aa94baefb10012b48209aaa08697af7b.zip
cd KataGo-7cbff847aa94baefb10012b48209aaa08697af7b
Then, follow the second step to the fifth step in my previous comment.
Build with Xcode: [See the previous comment]
Set Up Configuration: [See the previous comment]
Set Up the Network: [See the previous comment]
Set Up the CoreML Model: [See the previous comment]
For the third step, it shows jameswang@Jamess-MacBook-Pro models % ln -s ../../../../../../models/kata1-b18c384nbt-s7709731328-d3715293823.bin.gz ../cpp/xcode/DerivedData/Build/Products/Release/model.bin.gz ln: ../cpp/xcode/DerivedData/Build/Products/Release/model.bin.gz: No such file or directory
I am so dumb at coding...
Sorry for confusion. I didn't notice the path problem. There is another problem of linking framework in the previous comment, so let's use cmake instead. I have put all of the commands together as follows:
brew install ninja
cd cpp
mv CMakeLists.txt-macos CMakeLists.txt
mkdir build
cd build
cmake -G Ninja -DNO_GIT_REVISION=1 ../
ninja
cd ../..
ln -s ../configs/misc/coreml_example.cfg cpp/build/gtp.cfg
mkdir -p models
cd models
wget https://github.com/ChinChangYang/KataGo/releases/download/v1.13.2-coreml1/kata1-b18c384nbt-s7709731328-d3715293823.bin.gz
ln -s ../../models/kata1-b18c384nbt-s7709731328-d3715293823.bin.gz ../cpp/build/model.bin.gz
cd ..
mkdir -p models
cd models
wget https://github.com/ChinChangYang/KataGo/releases/download/v1.13.2-coreml1/KataGoModel19x19fp16v14s7709731328.mlpackage.zip
unzip KataGoModel19x19fp16v14s7709731328.mlpackage.zip
ln -s ../../models/KataGoModel19x19fp16v14s7709731328.mlpackage ../cpp/build/KataGoModel19x19fp16.mlpackage
cd ..
ln -s ../tests cpp/build/tests
cd cpp/build
./katago runnnlayertests
./katago runoutputtests
./katago runnnontinyboardtest model.bin.gz false false 0 false
./katago runnnsymmetriestest model.bin.gz false false false
./katago runownershiptests gtp.cfg model.bin.gz
./katago benchmark -model model.bin.gz -config gtp.cfg
Thanks for being so helpful but I still can not run it. It says Cannot write to ‘kata1-b18c384nbt-s7709731328-d3715293823.bin.gz’ (Success). ln: ../cpp/build/model.bin.gz: No such file or directory mkdir: models: Read-only file system cd: no such file or directory: models --2023-11-21 20:42:24-- https://github.com/ChinChangYang/KataGo/releases/download/v1.13.2-coreml1/KataGoModel19x19fp16v14s7709731328.mlpackage.zip Resolving github.com (github.com)... 140.82.112.3 Connecting to github.com (github.com)|140.82.112.3|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/522444875/63aa44d9-b307-4128-9fbd-589e823d9a54?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231122T014224Z&X-Amz-Expires=300&X-Amz-Signature=43e36fc826e144d0e5ef401e1800d9720808a66c9614a6d4301a8346d292f152&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=522444875&response-content-disposition=attachment%3B%20filename%3DKataGoModel19x19fp16v14s7709731328.mlpackage.zip&response-content-type=application%2Foctet-stream [following] --2023-11-21 20:42:24-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/522444875/63aa44d9-b307-4128-9fbd-589e823d9a54?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231122T014224Z&X-Amz-Expires=300&X-Amz-Signature=43e36fc826e144d0e5ef401e1800d9720808a66c9614a6d4301a8346d292f152&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=522444875&response-content-disposition=attachment%3B%20filename%3DKataGoModel19x19fp16v14s7709731328.mlpackage.zip&response-content-type=application%2Foctet-stream Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ... Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.110.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 53443280 (51M) [application/octet-stream] KataGoModel19x19fp16v14s7709731328.mlpackage.zip: Read-only file system
Cannot write to ‘KataGoModel19x19fp16v14s7709731328.mlpackage.zip’ (Success). unzip: cannot find or open KataGoModel19x19fp16v14s7709731328.mlpackage.zip, KataGoModel19x19fp16v14s7709731328.mlpackage.zip.zip or KataGoModel19x19fp16v14s7709731328.mlpackage.zip.ZIP. ln: ../cpp/build/KataGoModel19x19fp16.mlpackage: No such file or directory ln: cpp/build/tests: No such file or directory cd: no such file or directory: cpp/build
Do you have any idea why this happen?
Looks like a process blocks. Could you clean up the downloaded files/directories, terminate all programs and restart your Macbook? After rebooting the Macbook, please try again with the commands in my previous comment.
What's the status? This fork's instructions provide a huge gain on Mac over the mainline katago: https://github.com/ChinChangYang/KataGo/blob/metal-coreml-stable/docs/CoreML_Backend.md
Hi,
I'd like to run KataGo on my M1 macbook with native support for CoreML but sadly, I don't know the framework plus I have no experience with macOS development.
Hence I'd like to create this issue and ask the community if they are willing to financially support the development of ~CoreML~ Metal GPU backend. @lightvector said it would be very time consuming.
Thoughts?
P.S: Here is the output of benchmark of
g170-b40c256x2-s5095420928-d1229425124.bin.gz
on my Macbook Pro 14' M1: