The future of KataGo on macOS

ghost commented 4 years ago

Hello,

OpenCL is deprecated in macOS 10.15, and it will probably be removed in the near future.

Are there plans to create a CoreML backend? That is, does KataGo have a future on macOS?

I'm asking because I'm considering buying a MacBook (have not decided on the model), and it should last a few years.

As far as I know the iOS app "A Master of Go" uses a CoreML backend, so it should be possible, although I'm not sure whether CoreML can be used with C++.

Thank you.

isty2e commented 4 years ago

This leaves another question: Will @y-ich eventually release a mac version of the app?

lightvector commented 4 years ago

Maybe KataGo can support it in the future. But I can't guarantee that I'll spend all the effort to write yet another GPU backend. It's an enormous amount of work to invest for me. It's sad to see Apple dropping support for an open standard that currently works pretty well across a wide range of devices. From my perspective, all it does it create yet another burden with no benefit.

y-ich commented 4 years ago

In fact, KataGo on Core ML works well on macOS. I may release "A Master of Go" on macOS when Apple stops providing OpenCL. The prototype app is just compiled as Catalyst. No mouse-over support. Currently Lizzie with KataGo is more useful than it.

isty2e commented 4 years ago

@y-ich Asking just out of curiosity, what about the performance? Is the CoreML version faster than OpenCL one?

ghost commented 4 years ago

@y-ich That is good to hear! But I agree that KataGo with a GTP interface is more useful because it can run under Lizzie or Sabaki.

@lightvector It's unfortunate that Apple is removing support for standards and doing its own thing, but unfortunately that's how it is. Without a CoreML backend both KataGo nor Leela Zero will be completely unusable on macOS in the near future. I want to help but am neither proficient with C++ nor with CoreML…

y-ich commented 4 years ago

@isty2e san,

Maybe my Mac (Late 2012) is too old to talk about reference performance, but I think that basically optimized OpenCL is much faster than Core ML. 7 nnBatches/s by OpenCL KataGo, 3 evaluations per sec. by CoreML KataGo, on my Mac.

l1t1 commented 4 years ago

did someone test go engine on the vmware or subsystem, such as ubuntu of windows 10?

lightvector commented 4 years ago

(deleted a comment that was a duplicate of a comment posted in another issue and is off-topic for this issue).

dappstore123 commented 4 years ago

@y-ich is coreml backend opensource ？ https://github.com/y-ich/KataGo/blob/browser/cpp/neuralnet/tfjsbackend.cpp

y-ich commented 4 years ago

@dappstore123 san,

No, not yet.

PJTraill commented 4 years ago

Is there any chance that someone could create a generic OpenCL wrapper for CoreML, with at least sufficient functionality for KataGo? (This may well reveal how little I know about macOS, OpenCL & CoreML, but it is the first thing that occurs to me as a retired developer.)

ghost commented 3 years ago

The M1 processor of Apple's new MacBooks includes a Neural Engine, which should be very capable of running KataGo using Core ML.

Unfortunately I'm not familiar with either C++ or neural networks but if there's anything I can do to help… Thank you.

lian9998 commented 3 years ago

at least katago has a cpu backend. I belive apple will not remove c++ support in near future.

peepo commented 3 years ago

M1 GPU is said to be ahead of the field: Neural Engine, with 16 Cores and the capability of doing 11 trillion operations per second.

does anyone have evidence katago plays pro on M1?

tx

tbiscuit commented 2 years ago

It plays... at about 1/10th the performance as on a decent NVidia GPU, at least as I write this with the benchmark running on both. But I don't have the most recent M1s, this is the little 13 inch laptop. That said, I don't have high confidence in the larger laptops since it seems that it's just not really taking advantage of the 8-core GPU.

anotheros commented 2 years ago

I test on M1, about 400V / s, 40b weight

ChinChangYang commented 2 years ago

I tested katago on Apple M1 Pro:

% cpp/katago version
KataGo v1.11.0
Git revision: 714ad713a829e2fca465ec4113bb1afa4c6ec543
Compile Time: Jul 31 2022 11:35:36
Using OpenCL backend

I got visits/s = 196.37 from the model: g170-b40c256x2-s5095420928-d1229425124:

% cpp/katago benchmark -config cpp/configs/gtp_example.cfg 
...
2022-08-01 13:44:57+0800: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 13:44:58+0800: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:05))
2022-08-01 13:44:58+0800: Found 2 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 13:44:58+0800: Found OpenCL Device 0: Apple M1 Pro (Intel) (score 102)
2022-08-01 13:44:58+0800: Found OpenCL Device 1: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 13:44:58+0800: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:05))
2022-08-01 13:44:58+0800: Using OpenCL Device 1: Apple M1 Pro (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 13:44:58+0800: Loaded tuning parameters from: /Users/chinchangyang/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 13:44:58+0800: OpenCL backend thread 0: Model version 8
2022-08-01 13:44:58+0800: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 13:45:01+0800: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false

Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 

numSearchThreads =  5: 10 / 10 positions, visits/s = 127.91 nnEvals/s = 105.35 nnBatches/s = 42.33 avgBatchSize = 2.49 (62.9 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 178.48 nnEvals/s = 146.17 nnBatches/s = 24.69 avgBatchSize = 5.92 (45.4 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 168.67 nnEvals/s = 138.70 nnBatches/s = 28.07 avgBatchSize = 4.94 (48.0 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 196.37 nnEvals/s = 169.76 nnBatches/s = 17.37 avgBatchSize = 9.77 (41.7 secs)
numSearchThreads =  8: 10 / 10 positions, visits/s = 163.03 nnEvals/s = 134.23 nnBatches/s = 33.88 avgBatchSize = 3.96 (49.5 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 195.88 nnEvals/s = 163.82 nnBatches/s = 20.83 avgBatchSize = 7.86 (41.6 secs)

Ordered summary of results: 

numSearchThreads =  5: 10 / 10 positions, visits/s = 127.91 nnEvals/s = 105.35 nnBatches/s = 42.33 avgBatchSize = 2.49 (62.9 secs) (EloDiff baseline)
numSearchThreads =  8: 10 / 10 positions, visits/s = 163.03 nnEvals/s = 134.23 nnBatches/s = 33.88 avgBatchSize = 3.96 (49.5 secs) (EloDiff +70)
numSearchThreads = 10: 10 / 10 positions, visits/s = 168.67 nnEvals/s = 138.70 nnBatches/s = 28.07 avgBatchSize = 4.94 (48.0 secs) (EloDiff +70)
numSearchThreads = 12: 10 / 10 positions, visits/s = 178.48 nnEvals/s = 146.17 nnBatches/s = 24.69 avgBatchSize = 5.92 (45.4 secs) (EloDiff +78)
numSearchThreads = 16: 10 / 10 positions, visits/s = 195.88 nnEvals/s = 163.82 nnBatches/s = 20.83 avgBatchSize = 7.86 (41.6 secs) (EloDiff +90)
numSearchThreads = 20: 10 / 10 positions, visits/s = 196.37 nnEvals/s = 169.76 nnBatches/s = 17.37 avgBatchSize = 9.77 (41.7 secs) (EloDiff +65)
...

peepo commented 2 years ago

Which Apple M1 Pro box?

score 1000102 seems high, or even out of range?

ChinChangYang commented 2 years ago

Which Apple M1 Pro box?

score 1000102 seems high, or even out of range?

It is Macbook Pro with M1 Pro chip.

Hologos commented 2 years ago

Which Apple M1 Pro box?

score 1000102 seems high, or even out of range?

I can confirm, I get the same value on my Macbook Pro 14' M1.

peepo commented 2 years ago

any ideas why the score is around 20x better than other OpenCL scores?

lightvector commented 2 years ago

The score doesn't have any numerical meaning, there is no such thing as "20x better", the same way that if some software had a version 2.2 it would not make sense to say it is "10% more recent" than a software version 2.0. It's just some hardcoded logic to pick which OpenCL platform by default if the user doesn't specify, based on the name of the platform.

peepo commented 2 years ago

OpenCL scores: https://browser.geekbench.com/opencl-benchmarks

if it is not a score, perhaps a different word could be used?

lightvector commented 2 years ago

Heh, it never even crossed my mind that anyone would associate an automatic number logged in KataGo's output to have some relation to some specific benchmark run by a major but otherwise arbitrary benchmarking website that I've possibly never looked at myself. Sure, I'll consider it, but "score" is a very common and generic word that is used by a lot of different things, so I'd be interested to hear if this confusion is actually common.

tbiscuit commented 2 years ago

"score" means a written piece of music, to mark or scratch, a number of points made in a game, or, the result of a benchmark or test. Given the context, any native speaker of English is going to presume you mean the result of a benchmark or test. It's clearly not a piece of music, it's not marking or scratching anything, and it's not interactive so there's no active game being played nor potential for it to be referring to some other game or sporting event. I'm curious why the word was picked? "Version" is usually the English word used for "arbitrary changing number that is totally not a score."

lightvector commented 2 years ago

Hehe, well "version" would be an even more inaccurate word, because this is definitely not a number that labels different varieties or kinds or... well... versions, of a given thing. :)

What word would you use to describe a numerical value, computed by adding up several component values based on the device's name, vendor, accelerator type, and how recent a version of OpenCL it supports (these are the main pieces of metadata that a device will report about itself that we have available), to guess at an overall likely best default device to choose?

Normally I would describe the above by saying that we are heuristically scoring the device based on testing the contents of these metadata fields. For example the vendor being Nvidia contributes a higher score to the device than if it's Intel, because usually if a user has an Nvidia device and an Intel device on the same machine, they probably prefer the former because the former is likely some high-powered GPU actually designed for running large parallel computations, whereas the latter is likely some cheap onboard integrated graphics chip.

The numerical magnitude of the "scores" is of course arbitrary except for the fact that they are a hardcoded guess at the likely preferability of different things, where larger is more likely preferable.

lightvector commented 2 years ago

Anyways, sounds like I should change the label in the output anyways, just to make it clearer that it doesn't have anything to do with anything external. Maybe it should be called "ordering" or something.