ronaldoussoren / pyobjc

The Python <-> Objective-C Bridge with bindings for macOS frameworks
https://pyobjc.readthedocs.io
547 stars 46 forks source link

Metal library: fail to serialize on new Macos Sonoma 14.1 #580

Open ductm104 opened 9 months ago

ductm104 commented 9 months ago

Describe the bug I'm using Macbook Air m1 (base 8gb) with Macos Sonoma 14.1.1, python 3.11 (conda) I have been trying to manually compile from source and seriallize it to use latter but got this error: MTLLibrary is not formatted as a MetalLib file.

To Reproduce

import Metal, Cocoa, libdispatch
prg = """ 
#include <metal_stdlib>
using namespace metal;
kernel void E_(device float* data,
               uint3 gid [[threadgroup_position_in_grid]],
               uint3 lid [[thread_position_in_threadgroup]])
{
    *(data) = 10;
}
"""
device = Metal.MTLCreateSystemDefaultDevice()
options = Metal.MTLCompileOptions.new()
lib, err = device.newLibraryWithSource_options_error_(prg, options, None)
print(lib)
print(lib.functionNames())
ret = lib.serializeToURL_error_('./mylib.metallib', None)
print(ret)

Expected behavior

<_MTLLibrary: 0x600003b9af80>
    label = <none> 
    device = <AGXG13GDevice: 0x11e819e00>
        name = Apple M1 
    functionNames: E_
(
    "E_"
)
(False, Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file." UserInfo={NSLocalizedDescription=MTLLibrary is not formatted as a MetalLib file.})

Additional context With the same program, I successfully compiled using it with Xcode by:

xcrun -sdk macosx metal -x metal -c src.c -o lib.air
xcrun -sdk macosx metallib -o lib.metallib lib.air

It seems like there is a problem with new macos version on macbook air which causes the output binary library to have different format as compared with Xcode. Since I can load a Xcode-compiled metallib but not with PyobjMetal.

The code bellow ran successfully with above `lib.metallib'

fname = 'lib.metallib'
xlib, _ = device.newLibraryWithURL_error_(fname, None)
content = xlib.libraryDataContents().bytes().tobytes()
data = libdispatch.dispatch_data_create(content, len(content), None, None)
newlib, _ = device.newLibraryWithData_error_(data, None)
fxn = newlib.newFunctionWithName_('E_')
print(fxn)

The expected output should be:

<_MTLFunctionInternal: 0x14463f1b0>
    name = E_ 
    device = <AGXG13GDevice: 0x124808a00>
        name = Apple M1 
    functionType = MTLFunctionTypeKernel 
    attributes: <none>
tomtom-95 commented 9 months ago

I have the same problem. @ductm104 Did you find a solution?

tyoc213 commented 8 months ago

@ductm104 how did you fixed it? or ended up using

xcrun -sdk macosx metal -x metal -c src.c -o lib.air
xcrun -sdk macosx metallib -o lib.metallib lib.air

???

ductm104 commented 8 months ago

@tyoc213 I didn't. The issue hasn't been fixed.

ronaldoussoren commented 7 months ago

I don't get these errors with the code below, which is combines the code below, adds more printing and uses URLs where necessary.

import Metal, Cocoa, libdispatch
prg = """
#include <metal_stdlib>
using namespace metal;
kernel void E_(device float* data,
               uint3 gid [[threadgroup_position_in_grid]],
               uint3 lid [[thread_position_in_threadgroup]])
{
    *(data) = 10;
}
"""
device = Metal.MTLCreateSystemDefaultDevice()
options = Metal.MTLCompileOptions.new()
lib, err = device.newLibraryWithSource_options_error_(prg, options, None)
print(f"newLibrary {lib=} {err=}")
print(lib.functionNames())
ret, error = lib.serializeToURL_error_(Cocoa.NSURL.fileURLWithPath_('./mylib.metallib'), None)
print(f"serialize {ret=} {error=}")

fname = Cocoa.NSURL.fileURLWithPath_('mylib.metallib')
xlib, error = device.newLibraryWithURL_error_(fname, None)
print(f"newLibrary {xlib=}, {error=}")
content = xlib.libraryDataContents().bytes().tobytes()
data = libdispatch.dispatch_data_create(content, len(content), None, None)
newlib, error = device.newLibraryWithData_error_(data, None)
print(f"newWithData {newlib=}, {error=}")
fxn = newlib.newFunctionWithName_('E_')
print(f"{fxn=}")

Running this prints:

newLibrary lib=<_MTLLibrary: 0x600001eaa1c0>
    label = <none> 
    device = <AGXG13GDevice: 0x14b040000>
        name = Apple M1 
    functionNames: E_ err=None
(
    "E_"
)
serialize ret=True error=None
newLibrary xlib=<_MTLLibrary: 0x600001eaa540>
    label = <none> 
    device = <AGXG13GDevice: 0x14b040000>
        name = Apple M1 
    functionNames: E_, error=None
newWithData newlib=<_MTLLibrary: 0x600001eaa600>
    label = <none> 
    device = <AGXG13GDevice: 0x14b040000>
        name = Apple M1 
    functionNames: E_, error=None
fxn=<_MTLFunctionInternal: 0x14b80b9a0>
    name = E_ 
    device = <AGXG13GDevice: 0x14b040000>
        name = Apple M1 
    functionType = MTLFunctionTypeKernel 
    attributes: <none>

I don't use Metal myself, if this isn't expected behaviour I need some more information on what the expected behaviour is.

Output is on an M1 system running macOS 14.3.1 (23D60), using Python 3.12 and PyObjC from the repo (the latter shouldn't matter, there have been some updates in the repo but none that are relevant for this issue).

pushpendre commented 7 months ago

Weirdly enough when I run this same code I get the error that "MTLLibrary is not formatted as a MetalLib file"

(tg) [20:12:22]Mac:~/w/(02-22_06:16)
$ python -c 'import objc; print(objc.__version__)'
10.1
(tg) [20:13:26]Mac:~/w/(02-22_06:16)
$ sw_vers
ProductName:            macOS
ProductVersion:         14.3.1
BuildVersion:           23D60
(tg) [20:10:29]Mac:~/w/(02-22_06:16)
$ python ../tg/try_tensor.py 
newLibrary lib=<_MTLLibrary: 0x12f646b10>
    label = <none> 
    device = <AGXG13XDevice: 0x12f821a00>
        name = Apple M1 Pro 
    functionNames: E_ err=None
(
    "E_"
)
serialize ret=False error=Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file." UserInfo={NSLocalizedDescription=MTLLibrary is not formatted as a MetalLib file.}
newLibrary xlib=None, error=Error Domain=MTLLibraryErrorDomain Code=6 "library not found" UserInfo={NSLocalizedDescription=library not found}
Traceback (most recent call last):
  File "/Users/pushpen/w/tinygrad/../tg/try_tensor.py", line 23, in <module>
    content = xlib.libraryDataContents().bytes().tobytes()
              ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'libraryDataContents'
Leikoe commented 6 months ago

this bug is related to conda, it only happens when you use conda's python. though I have no idea why. @ronaldoussoren if you want to reproduce it.

pushpendre commented 6 months ago

Can confirm, I had to make some more changes to get this thing to work but the bug disappeared with the system python3. Specifically here were the steps

  1. Install XCode, if only command line tools are installed we might get an error while running xcrun

  2. confirm that xcrun can compile the c src

    cat > src.c <<EOF
    #include <metal_stdlib>
    using namespace metal;
    kernel void E_(device float* data,
               uint3 gid [[threadgroup_position_in_grid]],
               uint3 lid [[thread_position_in_threadgroup]])
    {
    *(data) = 10;
    }
    EOF
    xcrun -sdk macosx metal -x metal -c src.c -o lib.air && xcrun -sdk macosx metallib -o lib.metallib lib.air
  3. If the above is successful then make sure pyobjc is installed and the following command runs.

/usr/bin/python3  -m pip install pyobjc
/usr/bin/python3  -m pip install pyobjc-framework-Metal pyobjc-framework-libdispatch
/usr/bin/python3  -c 'from Foundation import NSURL'
  1. If we pass raw strings to the fileURLWithPath function then we get the error ValueError: NSInvalidArgumentException - -[OC_BuiltinPythonUnicode isFileURL]: unrecognized selector sent to instance 0x600002181e80 so instead of passing string pass the NSURL so I changed the code as follows
cat > try_tensor.py <<EOF
import Metal, Cocoa, libdispatch
from Foundation import NSURL

prg = """ 
#include <metal_stdlib>
using namespace metal;
kernel void E_(device float* data,
               uint3 gid [[threadgroup_position_in_grid]],
               uint3 lid [[thread_position_in_threadgroup]])
{
    *(data) = 10;
}
"""
device = Metal.MTLCreateSystemDefaultDevice()
options = Metal.MTLCompileOptions.new()
lib, err = device.newLibraryWithSource_options_error_(prg, options, None)
print(lib)
print(lib.functionNames())
file_url = NSURL.fileURLWithPath_('./mylib.metallib')
ret = lib.serializeToURL_error_(file_url, None)
print(ret)
EOF
/usr/bin/python3 try_tensor.py 
head -c 4 mylib.metallib 

The above shows a compiled file with the right MTLB header.

ronaldoussoren commented 6 months ago

this bug is related to conda, it only happens when you use conda's python. though I have no idea why. @ronaldoussoren if you want to reproduce it.

Do you also use the version of PyObjC packaged by Conda?

Leikoe commented 6 months ago

this bug is related to conda, it only happens when you use conda's python. though I have no idea why. @ronaldoussoren if you want to reproduce it.

Do you also use the version of PyObjC packaged by Conda?

No, I used the pip one and even manually tried replacing it by hand in site-packages by the pip one from normal python, still gives the same problems

Leikoe commented 6 months ago

You can track my debugging work at https://github.com/tinygrad/tinygrad/issues/2226

nullhook commented 6 months ago

the metal driver and pyobjc doesn't seem like the issue here as the output is correct in non-virtual environments.

the main question is, why the returned binary data has header and footer of .metallib swallowed in conda environment?