Add VisionKit bindings - Githubissues

Is your feature request related to a problem? Please describe. The VisionKit APIs seem to be more actively supported, as an example starting in Sonoma the text recognition now supports vertical text for CJK languages (Japanese, Chinese, Korean) which is not yet supported in Vision.

Describe the solution you'd like VisionKit bindings to be available for use on 13.0+

Describe alternatives you've considered There's no real alternative right now other than using the less updated Vision api or invoking external command line tools.

Additional context The docs state that VisionKit "is only available in Catalyst" but that doesn't seem to be the case (anymore?) from Ventura onwards. There are apps using the new APIs on macOS (eg https://github.com/Shakshi3104/LiTeX, TextSniper also seems to use it according to a friend's reverse engineering). Apple's API docs claim it's available on macOS as well https://developer.apple.com/documentation/visionkit/imageanalyzer

The documentation claims that it is available on macOS 13, but ...

The headers included in the SDK (Xcode 15.3 beta 2) are empty
The Objective-C classes documented on the website are only available on iOS or through Mac Catalyst.
The class you link to is a Swift class that cannot be used in Objective-C

Currently PyObjC can only be used with interfaces that can be used in Objective-C code. It might be possible to expose Swift frameworks as well, but this likely requires significant engineering to design and implement.

@ronaldoussoren right, sorry for not realizing and wasting your time!

On another note (this is probably not something to officially support/implement I guess) I noticed there's an underlying objective C implementation for the stuff I need in VisionKit, it's not documented but WebKit does use it directly I got most of the way through I think, this code seems to work up to the processRequest bit because I'm not sure how to properly do the registerMetaDataForSelector (I don't really know much about objective C):

import Cocoa
import objc

ns_image = Cocoa.NSImage.alloc().initWithContentsOfFile_("/Users/aurora/Downloads/tg_image_1633323779.jpeg")

objc.loadBundle('VisionKit', globals(), '/System/Library/Frameworks/VisionKit.framework')
req=VKImageAnalyzerRequest.alloc().initWithImage_requestType_(ns_image, 1)
req.setLocales_('ja-JA')
objc.registerMetaDataForSelector(
        b"VKImageAnalyzer",
        b"processRequest:updateHandler:completionHandler:",
        {
            "arguments": {
                4: {
                    "callable": {
                        "retval": {"type": b"v"},
                        "arguments": {
                            0: {"type": b"^v"},
                            1: {"type": b"@"},
                            2: {"type": b"@"},
                            3: {"type": b"@"},
                        },
                    }
                }
            }
        },
)
analyzer=VKImageAnalyzer.alloc().init()

def update(self, progress:float):
    pass

def process(self, analysis:VKImageAnalysis):
    pass

analyzer.processRequest_updateHandler_completionHandler_(req, update, process)

According to WebKit source processRequest is defined like this: (VKImageAnalysisRequestID)processRequest:(VKImageAnalyzerRequest *)request progressHandler:(void (^_Nullable)(double progress))progressHandler completionHandler:(void (^)(VKImageAnalysis *_Nullable analysis, NSError *_Nullable error))completionHandler;

How should I define it in registerMetaDataForSelector?

@ronaldoussoren right, sorry for not realizing and wasting your time!

No need to apologise, it wouldn't be the first time that I missed a new API.

On another note (this is probably not something to officially support/implement I guess) I noticed there's an underlying objective C implementation for the stuff I need in VisionKit, it's not documented but WebKit does use it directly I got most of the way through I think, this code seems to work up to the processRequest bit because I'm not sure how to properly do the registerMetaDataForSelector (I don't really know much about objective C):
import Cocoa
import objc

ns_image = Cocoa.NSImage.alloc().initWithContentsOfFile_("/Users/aurora/Downloads/tg_image_1633323779.jpeg")

objc.loadBundle('VisionKit', globals(), '/System/Library/Frameworks/VisionKit.framework')
req=VKImageAnalyzerRequest.alloc().initWithImage_requestType_(ns_image, 1)
req.setLocales_('ja-JA')
objc.registerMetaDataForSelector(
        b"VKImageAnalyzer",
        b"processRequest:updateHandler:completionHandler:",
        {
            "arguments": {
                4: {
                    "callable": {
                        "retval": {"type": b"v"},
                        "arguments": {
                            0: {"type": b"^v"},
                            1: {"type": b"@"},
                            2: {"type": b"@"},
                            3: {"type": b"@"},
                        },
                    }
                }
            }
        },
)
analyzer=VKImageAnalyzer.alloc().init()

def update(self, progress:float):
    pass

def process(self, analysis:VKImageAnalysis):
    pass

analyzer.processRequest_updateHandler_completionHandler_(req, update, process)
According to WebKit source processRequest is defined like this: (VKImageAnalysisRequestID)processRequest:(VKImageAnalyzerRequest *)request progressHandler:(void (^_Nullable)(double progress))progressHandler completionHandler:(void (^)(VKImageAnalysis *_Nullable analysis, NSError *_Nullable error))completionHandler;

How should I define it in registerMetaDataForSelector?

You got it almost right, but the method has two arguments that are blocks. Both return "void", the first one has a single argument of type double, the second has to arguments and both are Objective-C objects:

objc.registerMetaDataForSelector(
        b"VKImageAnalyzer",
        b"processRequest:updateHandler:completionHandler:",
        {
            "arguments": {
                3: {
                  "callable": {
                   "retval": { "type": "v" },
                    "arguments": {
                      0: { "type": "^v" },
                      1: { "type": "d" },
                     }
                },
                4: {
                    "callable": {
                        "retval": {"type": b"v"},
                        "arguments": {
                            0: {"type": b"^v"},
                            1: {"type": b"@"},
                            2: {"type": b"@"},
                        },
                    }
                }
            }
        },
)

I haven't used the Vision framework myself yet, but it does seem to have some options for recognizing text, see https://developer.apple.com/documentation/vision/vnrecognizetextrequest?language=objc and https://developer.apple.com/documentation/vision/recognizing_text_in_images?language=objc (both have sample code in Swift, but hopefully that has enough context to be clear how to reproduce this in Python)

objc.registerMetaDataForSelector(
        b"VKImageAnalyzer",
        b"processRequest:updateHandler:completionHandler:",
        {
            "arguments": {
                3: {
                  "callable": {
                   "retval": { "type": "v" },
                    "arguments": {
                      0: { "type": "^v" },
                      1: { "type": "d" },
                     }
                },
                4: {
                    "callable": {
                        "retval": {"type": b"v"},
                        "arguments": {
                            0: {"type": b"^v"},
                            1: {"type": b"@"},
                            2: {"type": b"@"},
                        },
                    }
                }
            }
        },
)
I haven't used the Vision framework myself yet, but it does seem to have some options for recognizing text, see https://developer.apple.com/documentation/vision/vnrecognizetextrequest?language=objc and https://developer.apple.com/documentation/vision/recognizing_text_in_images?language=objc (both have sample code in Swift, but hopefully that has enough context to be clear how to reproduce this in Python)

Thanks, that did the trick! For what it's worth I do have a Vision fraemwork option in my OCR program but since it's for Japanese and vertical text is really helpful wanted to try getting the VisionKit stuff working too (it seems Apple updated VisionKit with vertical text in Sonoma but Vision still doesn't support it - actually, while in Ventura it tried to read vertical text horizontally in Sonoma it returns an empty array for the results).

This is the working VisionKit code:

import Cocoa
import objc
from PyObjCTools.AppHelper import runConsoleEventLoop, stopEventLoop

ns_image = Cocoa.NSImage.alloc().initWithContentsOfFile_("/Users/aurora/Downloads/Untitled.jpg")
objc.loadBundle('VisionKit', globals(), '/System/Library/Frameworks/VisionKit.framework')
req=VKCImageAnalyzerRequest.alloc().initWithImage_requestType_(ns_image, 1)
req.setLocales_(['ja','en'])
analyzer=VKCImageAnalyzer.alloc().init()
objc.registerMetaDataForSelector(
    b"VKCImageAnalyzer",
    b"processRequest:progressHandler:completionHandler:",
    {
        "arguments": {
            3: {
              "callable": {
               "retval": { "type": "v" },
                "arguments": {
                  0: { "type": "^v" },
                  1: { "type": "d" },
                 }
            }
            },
            4: {
                "callable": {
                    "retval": {"type": b"v"},
                    "arguments": {
                        0: {"type": b"^v"},
                        1: {"type": b"@"},
                        2: {"type": b"@"},
                    },
                }
            }
        }
    },
)

def update(progress:float):
    pass

def process(analysis:VKCImageAnalysis, error:NSError):
    lines = analysis.allLines()
    for line in lines:
        print(line.string())
    stopEventLoop()

analyzer.processRequest_progressHandler_completionHandler_(req, update, process)
runConsoleEventLoop()

The only drawback is that it takes a couple seconds for objc.loadBundle() but I assume can't do much about that

The WebKit SPI header for this: https://github.com/WebKit/WebKit/blob/main/Source/WebCore/PAL/pal/spi/cocoa/VisionKitCoreSPI.h

That appears to use a private framework, see https://github.com/WebKit/WebKit/blob/7cd082919192095d0b017c6e5f7a36a47135bb8c/Source/WebCore/PAL/pal/cocoa/VisionKitCoreSoftLink.mm#L36

Exposing this through PyObjC shouldn't be too hard, but I don't know yet if I'll do so because I don't like exporting private APIs (mostly because those might break between releases of the OS).

The Swift interface for the framework also doesn't look to complicated, with some luck it is possible to expose that to Python. But as said, this does require some engineering because I currently don't interface to Swift framework. I don't known when I'll get around to this.

ronaldoussoren / pyobjc

Add VisionKit bindings #592