Native FP64 support on AMD GPUs

philipturner commented 2 years ago

On an OpenMM thread, I talked with some people about how AMD GPUs support cl_khr_fp64 with Apple's driver. I did not know whether the driver passes OpenCL C -> AIR -> AMDGPU; if so, that's good news for MoltenCL. I don't have an AMD-powered Mac, but someone with such a machine could test the theory. Use the following code for this exercise:

#if USE_DOUBLE_PRECISION
typedef double FLOAT_TYPE;
#else
typedef float FLOAT_TYPE;
#endif

__kernel void vecAdd(  __global FLOAT_TYPE *a,
                       __global FLOAT_TYPE *b,
                       __global FLOAT_TYPE *c,
                       const unsigned int n)
{
    // Get our global thread ID
    int id = 5; // get_global_id(0);

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

In a new directory, create a file called vecAdd.cl and paste the source code into it. Run the following commands. If the last two proceed without error, zip both vecAdd.air and vecAdd.metallib, and attach into a GitHub comment. Then I can investigate it further.

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0
xcrun metallib vecAdd.air -o vecAdd.metallib
xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
xcrun metallib vecAdd.air -o vecAdd.metallib

cc: @theschles

theschles commented 2 years ago

Hi @philipturner

I wasn't sure if it was ok to run the above in MacOS 12.6 or not, so I ran it. If it needs MacOS 13, I'll probably give it a week after launch...and then as long as Macs aren't still melting down around the world as of that moment, I'll upgrade.

So here's what my MacOS 12.6 did with the code above:

MoltenCL/Foo on  main [?] 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0

MoltenCL/Foo on  main [?] took 10s 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib

MoltenCL/Foo on  main [?] took 2s 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
vecAdd.cl:2:9: error: use of type 'double' requires cl_khr_fp64 extension to be enabled
typedef double FLOAT_TYPE;
        ^
vecAdd.cl:7:33: error: use of type 'FLOAT_TYPE' (aka 'double') requires cl_khr_fp64 extension to be enabled
__kernel void vecAdd(  __global FLOAT_TYPE *a,
                                ^
vecAdd.cl:8:33: error: use of type 'FLOAT_TYPE' (aka 'double') requires cl_khr_fp64 extension to be enabled
                       __global FLOAT_TYPE *b,
                                ^
vecAdd.cl:9:33: error: use of type 'FLOAT_TYPE' (aka 'double') requires cl_khr_fp64 extension to be enabled
                       __global FLOAT_TYPE *c,
                                ^
4 errors generated.

MoltenCL/Foo on  main [?] 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib
LLVM ERROR: Error opening 'vecAdd.air': No such file or directory!

MoltenCL/Foo on  main [?] 
➜

philipturner commented 2 years ago

Being on macOS 12.6 isn't an issue here; mostly for in the future with GPU virtual addresses. No pressure to upgrade the OS.

It looks like we can't compile FP64 directly to AIR from Metal command line tools. However, there's still a chance we could modify an AIR file to manually utilize FP64. I'll get back to you once I've written the OpenCL SPIR-V -> AIR transpiler, then we can test whether the AIR -> AMDGPU backend supports FP64.

theschles commented 2 years ago

Complete n00b here writing...

A bit of googling on the error message came up with the following:

To do so in the kernel code, one would normally add a line
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable"

If I put that at the top of the code, it throws no error when I enable double precision on the command line.

Again, complete n00b here. Why no explicit call to enable the cl_khr_fp64 extension?

UPDATE I also found this which explicitly looks for the Khronos cl_khr_fp64 extension...and falls back to an AMD FP64 extension:

#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#elif defined(cl_amd_fp64)
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#else
#error "Double precision floating point not supported by OpenCL implementation."
#endif

Again, no error thrown if I enable double precision on the command line like you had me do.

philipturner commented 2 years ago

No way! I'll have a mechanism soon to test that .metallib in the Metal runtime. If this works, we can access native AMD double precision from MoltenCL.

philipturner commented 2 years ago

The tests are ready. You need to overwrite the vecAdd.cl file with the code below, and create two new files. Then, run the following commands and report back the results.

xcrun metal -c vecAdd2.metal
xcrun metallib vecAdd2.air -o vecAdd2.metallib
swift Test.swift vecAdd2.metallib             
# should show 3 and 0

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0
xcrun metallib vecAdd.air -o vecAdd.metallib         
swift Test.swift vecAdd.metallib                      
# should show 3 and 0

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
xcrun metallib vecAdd.air -o vecAdd.metallib           
swift Test.swift vecAdd.metallib                       
# fails on an Apple M1 Max, meaning FP64 not supported
# hopefully it works on AMD; if not, double-check that
# the shown device is an AMD GPU

vecAdd.cl

```opencl #if USE_DOUBLE_PRECISION #pragma OPENCL EXTENSION cl_khr_fp64 : enable typedef double FLOAT_TYPE; #else typedef float FLOAT_TYPE; #endif __kernel void vecAdd( __global FLOAT_TYPE * restrict a, __global FLOAT_TYPE * restrict b, __global FLOAT_TYPE * restrict c, __constant unsigned int * restrict n) { // Get our global thread ID unsigned int id = 5; // get_global_id(0); // Make sure we do not go out of bounds if (id < n[0]) c[id] = a[id] + b[id]; } ```

vecAdd2.metal

```metal kernel void vecAdd(device float *a, device float *b, device float *c, constant uint *n) { // Get our global thread ID uint id = 5; // get_global_id(0); // Make sure we do not go out of bounds if (id < n[0]) c[id] = a[id] + b[id]; } ```

Test.swift

```swift import Metal guard CommandLine.arguments.count >= 2 else { print("Need to specify '.metallib' file.") exit(-1) } let device = MTLCreateSystemDefaultDevice()! print("Device:", device.name) let currentPath = FileManager.default.currentDirectoryPath let libraryPath = currentPath + "/" + CommandLine.arguments[1] let libraryURL = URL(fileURLWithPath: libraryPath) var library: MTLLibrary do { library = try device.makeLibrary(URL: libraryURL) } catch { print("Failed to load library file:", error.localizedDescription) exit(-1) } guard let functionName = library.functionNames.first, let function = library.makeFunction(name: functionName) else { let name = library.functionNames.first ?? "nil" print("Could not load Metal function '\(name)'.") exit(-1) } var pipeline: MTLComputePipelineState! let semaphore = DispatchSemaphore(value: 0) device.makeComputePipelineState( function: function, options: [.argumentInfo, .bufferTypeInfo], completionHandler: { _pipeline, reflection, error in if let error = error { print("Failed to generate pipeline:", error.localizedDescription) exit(-1) } // print("Reflection:") // print(reflection) pipeline = _pipeline semaphore.signal() }) semaphore.wait() let bufferA = device.makeBuffer(length: 1024, options: .storageModeShared)! let bufferB = device.makeBuffer(length: 1024, options: .storageModeShared)! let bufferC = device.makeBuffer(length: 1024, options: .storageModeShared)! let bufferN = device.makeBuffer(length: 1024, options: .storageModeShared)! for type in [Float.self, Double.self] as [Any.Type] { let commandQueue = device.makeCommandQueue()! let commandBuffer = commandQueue.makeCommandBuffer()! let computeEncoder = commandBuffer.makeComputeCommandEncoder()! let id: Int = 5 if type == Float.self { print("Testing Float") bufferA.contents().assumingMemoryBound(to: Float.self)[id] = 1 bufferB.contents().assumingMemoryBound(to: Float.self)[id] = 2 } else if type == Double.self { print("Testing Double") bufferA.contents().assumingMemoryBound(to: Double.self)[id] = 1 bufferB.contents().assumingMemoryBound(to: Double.self)[id] = 2 } bufferN.contents().assumingMemoryBound(to: UInt32.self)[0] = 90 computeEncoder.setComputePipelineState(pipeline) computeEncoder.setBuffer(bufferA, offset: 0, index: 0) computeEncoder.setBuffer(bufferB, offset: 0, index: 1) computeEncoder.setBuffer(bufferC, offset: 0, index: 2) computeEncoder.setBuffer(bufferN, offset: 0, index: 3) computeEncoder.dispatchThreads(MTLSizeMake(1,1,1), threadsPerThreadgroup: MTLSizeMake(1,1,1)) computeEncoder.endEncoding() commandBuffer.commit() commandBuffer.waitUntilCompleted() if type == Float.self { print("Result:", bufferC.contents().assumingMemoryBound(to: Float.self)[id]) } else if type == Double.self { print("Result:", bufferC.contents().assumingMemoryBound(to: Double.self)[id]) } } ```

philipturner commented 2 years ago

After running some calculations on Apple silicon FP64 emulation, it looks like 32-48x slower than native FP32 for multiplication (27x when only counting mantissa). If AMD has similar throughput for integer multiply instructions, that won't be bad compared to native 16x FP64. Considering that 1/50 of ops are FP64 in OpenMM mixed precision, and FP64 emulation is 40x slower than FP32:

AMD with native FP64: 49/50 FP32, 1/50 FP64 -> 1.3x slower than OpenMM single precision
AMD with emulated FP64: 49/50 FP32, 1/50 FP64 -> 1.8x slower than OpenMM single precision
Overall, if MoltenCL can only emulate FP64, simulations will take 40% more time.

theschles commented 2 years ago

Results of code execution on MacBook Pro 2017 w/ Radeon 560

MoltenCL/Foo on  main [?] 
➜ xcrun metal -c vecAdd2.metal
➜ xcrun metallib vecAdd2.air -o vecAdd2.metallib
➜ swift Test.swift vecAdd2.metallib             
Device: AMD Radeon Pro 560
Testing Float
Result: 3.0
Testing Double
Result: 0.0
`# should show 3 and 0`

➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0
➜ xcrun metallib vecAdd.air -o vecAdd.metallib         
➜ swift Test.swift vecAdd.metallib                      
Device: AMD Radeon Pro 560
Testing Float
Result: 3.0
Testing Double
Result: 0.0
`# should show 3 and 0`

➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
➜ xcrun metallib vecAdd.air -o vecAdd.metallib           
➜ swift Test.swift vecAdd.metallib                       
Device: AMD Radeon Pro 560
Testing Float
Result: 0.0
Testing Double
Result: 2.0000000000654836
`# fails on an Apple M1 Max, meaning FP64 not supported`
`# hopefully it works on AMD; if not, double-check that`
`# the shown device is an AMD GPU`

philipturner commented 2 years ago

That's awesome! We can access FP64 through the AIR -> AMDGPU backend. However, we seem to have incorrect results - the result should be 3 not 2.000000...65. Here's some preliminary investigation:

> swift repl
1> let mydouble: Double = 2.0000000000654836
mydouble: Double = 2.0000000000654836
2> print(UnsafeRawPointer(bitPattern: Int(mydouble.bitPattern)))
Optional(0x4000000000024000)
3> print(UnsafeRawPointer(bitPattern: Int(Double(2).bitPattern))) 
Optional(0x4000000000000000)
4> print(UnsafeRawPointer(bitPattern: Int(Double(1).bitPattern))) 
Optional(0x3ff0000000000000)
5> print(UnsafeRawPointer(bitPattern: Int(Double(3).bitPattern))) 
Optional(0x4008000000000000)

philipturner commented 2 years ago

Try this:

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0
xcrun metallib vecAdd.air -o vecAdd.metallib         
swift Test.swift vecAdd.metallib

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
xcrun metallib vecAdd.air -o vecAdd.metallib         
swift Test.swift vecAdd.metallib

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
xcrun metallib vecAdd.air -o vecAdd.metallib         
swift Test.swift vecAdd.metallib

xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1
xcrun metallib vecAdd.air -o vecAdd.metallib         
swift Test.swift vecAdd.metallib

vecAdd.cl

```opencl #if USE_DOUBLE_PRECISION #pragma OPENCL EXTENSION cl_khr_fp64 : enable typedef double FLOAT_TYPE; #else typedef float FLOAT_TYPE; #endif __kernel void vecAdd( __global FLOAT_TYPE * restrict a, __global FLOAT_TYPE * restrict b, __global FLOAT_TYPE * restrict c, __constant unsigned int * restrict n) { // Get our global thread ID unsigned int id = 5; // get_global_id(0); // Make sure we do not go out of bounds if (id < n[0]) { c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; c[4] = a[4] + b[4]; } } ```

Test.swift

```swift import Metal guard CommandLine.arguments.count >= 2 else { print("Need to specify '.metallib' file.") exit(-1) } // original typo: let device = MTLCopyAllDevices().first(where: { !$0.isLowPower })!e let device = MTLCopyAllDevices().first(where: { !$0.isLowPower })! print("Device:", device.name) let currentPath = FileManager.default.currentDirectoryPath let libraryPath = currentPath + "/" + CommandLine.arguments[1] let libraryURL = URL(fileURLWithPath: libraryPath) var library: MTLLibrary do { library = try device.makeLibrary(URL: libraryURL) } catch { print("Failed to load library file:", error.localizedDescription) exit(-1) } guard let functionName = library.functionNames.first, let function = library.makeFunction(name: functionName) else { let name = library.functionNames.first ?? "nil" print("Could not load Metal function '\(name)'.") exit(-1) } var pipeline: MTLComputePipelineState! let semaphore = DispatchSemaphore(value: 0) device.makeComputePipelineState( function: function, options: [.argumentInfo, .bufferTypeInfo], completionHandler: { _pipeline, reflection, error in if let error = error { print("Failed to generate pipeline:", error.localizedDescription) exit(-1) } // print("Reflection:") // print(reflection) pipeline = _pipeline semaphore.signal() }) semaphore.wait() let bufferA = device.makeBuffer(length: 1024, options: .storageModeShared)! let bufferB = device.makeBuffer(length: 1024, options: .storageModeShared)! let bufferC = device.makeBuffer(length: 1024, options: .storageModeShared)! let bufferN = device.makeBuffer(length: 1024, options: .storageModeShared)! for type in [Float.self, Double.self] as [Any.Type] { let commandQueue = device.makeCommandQueue()! let commandBuffer = commandQueue.makeCommandBuffer()! let computeEncoder = commandBuffer.makeComputeCommandEncoder()! if type == Float.self { print("Testing Float") for i in 0..<5 { bufferA.contents().assumingMemoryBound(to: Float.self)[i] = Float(i) + 1 bufferB.contents().assumingMemoryBound(to: Float.self)[i] = Float(i) + 2 } } else if type == Double.self { print("Testing Double") for i in 0..<5 { bufferA.contents().assumingMemoryBound(to: Double.self)[i] = Double(i) + 1 bufferB.contents().assumingMemoryBound(to: Double.self)[i] = Double(i) + 2 } } bufferN.contents().assumingMemoryBound(to: UInt32.self)[0] = 90 computeEncoder.setComputePipelineState(pipeline) computeEncoder.setBuffer(bufferA, offset: 0, index: 0) computeEncoder.setBuffer(bufferB, offset: 0, index: 1) computeEncoder.setBuffer(bufferC, offset: 0, index: 2) computeEncoder.setBuffer(bufferN, offset: 0, index: 3) computeEncoder.dispatchThreads(MTLSizeMake(1,1,1), threadsPerThreadgroup: MTLSizeMake(1,1,1)) computeEncoder.endEncoding() commandBuffer.commit() commandBuffer.waitUntilCompleted() if type == Float.self { let output = (0..<5).map { i in bufferC.contents().assumingMemoryBound(to: Float.self)[i] } print("Result:", output) } else if type == Double.self { let output = (0..<5).map { i in bufferC.contents().assumingMemoryBound(to: Double.self)[i] } print("Result:", output) } } ```

P.S. Stage Manager in macOS Ventura is extremely nice! Also, Ventura changed how you acquire the default Metal device in Swift scripts. You can's use MTLCreateSystemDefaultDevice() anymore, instead you must use MTLCopyAllDevices().

The script should still work on Monterey.

theschles commented 2 years ago

Sorry had work stuff that took priority

I've also updated to MacOS Ventura

Will tackle the above shortly

theschles commented 2 years ago

print(UnsafeRawPointer(bitPattern: Int(Double(3).bitPattern)))

On MacOS 13 Ventura with MacBook Pro with AMD:

MoltenCL/Foo on  main [⇣?] 
➜ swift repl
let mydouble: Double = 2.0000000000654836
Welcome to Apple Swift version 5.7 (swiftlang-5.7.0.127.4 clang-1400.0.29.50).
Type :help for assistance.
  1> let mydouble: Double = 2.0000000000654836
mydouble: Double = 2.0000000000654836
  2> print(UnsafeRawPointer(bitPattern: Int(mydouble.bitPattern)))
Optional(0x4000000000024000)
  3> print(UnsafeRawPointer(bitPattern: Int(Double(2).bitPattern))) 
Optional(0x4000000000000000)
  4> print(UnsafeRawPointer(bitPattern: Int(Double(1).bitPattern))) 
Optional(0x3ff0000000000000)
  5> print(UnsafeRawPointer(bitPattern: Int(Double(3).bitPattern))) 
Optional(0x4008000000000000)
  6>

theschles commented 2 years ago

Hi @philipturner something's not working with the code you gave me in https://github.com/philipturner/MoltenCL/issues/1#issuecomment-1291902685

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib         

MoltenCL/Foo on  main [⇣?] 
➜ swift Test.swift vecAdd.metallib
Test.swift:8:67: error: consecutive statements on a line must be separated by ';'
let device = MTLCopyAllDevices().first(where: { !$0.isLowPower })!e
                                                                  ^
                                                                  ;
Test.swift:8:67: error: cannot find 'e' in scope
let device = MTLCopyAllDevices().first(where: { !$0.isLowPower })!e
                                                                  ^

philipturner commented 2 years ago

Replace the line with:

let device = MTLCopyAllDevices().first(where: { !$0.isLowPower })!

macOS Ventura disabled fetching the GPU through MTLCreateSystemDefaultDevice() in command-line scripts. You have an extraneous e in the line, so remove that.

philipturner commented 2 years ago

My bad - I had a typo with that e.

theschles commented 2 years ago

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=0

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib           

MoltenCL/Foo on  main [⇣?] 
➜ swift Test.swift vecAdd.metallib                       
Device: AMD Radeon Pro 560
Testing Float
Result: [3.0, 5.0, 7.0, 9.0, 11.0]
Testing Double
Result: [384.0, 640.0, 0.0, 0.0, 0.0]

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib         

MoltenCL/Foo on  main [⇣?] 
➜ swift Test.swift vecAdd.metallib
Device: AMD Radeon Pro 560
Testing Float
Result: [2.0663e-40, 3.0, 3.0, 4.0, 5.0]
Testing Double
Result: [2.0000000000654836, 2.0, 3.0, 5.0, 5.0]

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib         

MoltenCL/Foo on  main [⇣?] 
➜ swift Test.swift vecAdd.metallib
Device: AMD Radeon Pro 560
Testing Float
Result: [2.0663e-40, 3.0, 3.0, 4.0, 5.0]
Testing Double
Result: [2.0000000000654836, 2.0, 3.0, 5.0, 5.0]

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metal -x cl -c vecAdd.cl -DUSE_DOUBLE_PRECISION=1

MoltenCL/Foo on  main [⇣?] 
➜ xcrun metallib vecAdd.air -o vecAdd.metallib         

MoltenCL/Foo on  main [⇣?] 
➜ swift Test.swift vecAdd.metallib
Device: AMD Radeon Pro 560
Testing Float
Result: [2.0663e-40, 3.0, 3.0, 4.0, 5.0]
Testing Double
Result: [2.0000000000654836, 2.0, 3.0, 5.0, 5.0]

philipturner commented 2 years ago

The results are deterministic, but will be difficult to investigate. It seems to just copy one operand instead of adding anything, sometimes mutating its value. I think Apple's AIR -> AMDGPU compiler was never programmed to harness FP64 on AMD GPUs. These are backend bugs with assembly language, which I don't think I have the resources to fix.

MoltenCL's emulation might not be all that bad (2x slower), but you might be better off using Apple's OpenCL 1.2 driver. It has decent performance on AMD, permitting native FP64 and properly enqueueing commands in a cl_queue. Although it doesn't permit subgroup reductions, AMD's threadgroup memory is very fast. The bigger problem is Apple GPUs, where none of the above statements apply.

Perhaps you could install Bootcamp with a non-licensed copy of Windows 10. Then run clinfo and investigate the Windows OpenCL driver. It might be version 2.0 or above, and support modern features like subgroup reductions. However, testing this is time-consuming.

theschles commented 2 years ago

Perhaps you could install Bootcamp with a non-licensed copy of Windows 10. Then run clinfo and investigate the Windows OpenCL driver. It might be version 2.0 or above, and support modern features like subgroup reductions. However, testing this is time-consuming.

I did that at the height of the pandemic; however, I'd rather not dual-boot. I paid all this money for a MacBook Pro...I'd like to use MacOS -- and when it's idle, have it try to find the next COVID-19 cure...

So where does that leave us? Are we not going to be able to get OpenMM / Folding@Home to work on MacOS with AMD GPU?

philipturner commented 2 years ago

So where does that leave us? Are we not going to be able to get OpenMM / Folding@Home to work on MacOS with AMD GPU?

It leaves us with two options.

(1) Use Apple's current OpenCL driver for AMD, and MoltenCL for M1.

(2) Use MoltenCL for AMD, and suffer the performance drop for FP64.

(3) A hybrid approach that switches OpenCL drivers based on what's fastest.

Either way, we should be able to run Folding@Home on AMD. In fact, method (1) should already be possible for your computer. Even then, there's still good reason to make MoltenCL compatible with AMD. MoltenCL will back the hipSYCL Metal backend, making it possible to optimize other code bases like GROMACS for Intel Macs.

theschles commented 2 years ago

Which way do the OpenMM people want to go?

philipturner commented 2 years ago

Which way do the OpenMM people want to go?

I have no idea at the moment, but the easiest way might be supporting AMD GPUs with the current OpenCL driver. MoltenCL interacts with a lot of low-level assembly compilation, creating several opportunities for troublesome bugs. It might save time to minimize how many platforms they deploy MoltenCL on. The only big reason to use MoltenCL on AMD would be subgroup shuffles/reductions, which OpenMM might not use extensively.

philipturner commented 2 years ago

Would you mind running this inside the Swift REPL? I'm trying to support OpenCL profiling through Metal, a mandatory feature with OpenCL 2.0 and 3.0. x86 and Apple silicon devices will require two slightly different profiling methods.

import Metal
MTLCopyAllDevices().forEach { device in
    print(device.supportsCounterSampling(.atDispatchBoundary))
    print(device.supportsCounterSampling(.atBlitBoundary)) 
    print(device.supportsCounterSampling(.atDrawBoundary)) 
    print(device.supportsCounterSampling(.atStageBoundary))  
    print(device.supportsCounterSampling(.atTileDispatchBoundary))  
}

philipturner commented 2 years ago

Which way do the OpenMM people want to go?

I recently had a talk with Mr. Chodera and Mr. Eastman. Although I do not speak for their interests or opinions, we might put most effort toward improving performance on Apple GPUs. That means Mac AMD GPUs could remain on the Apple OpenCL driver. Meanwhile, Apple GPUs use something new. Creating a custom Metal backend is now a possibility again. As for FP64 emulation, we either create a better summation algorithm using FP32, or use metal-float64 as a Metal library.

If OpenMM does use Metal directly, there's little motivation for me to finish MoltenCL. The other work I'm planning should go straight into hipSYCL. In that case, your investigation still helped me learn a lot of new things. Thanks for helping me out!

theschles commented 2 years ago

@philipturner

MoltenCL/Foo on  main [⇣?] 
➜ swift repl
Welcome to Apple Swift version 5.7.1 (swiftlang-5.7.1.135.3 clang-1400.0.29.51).
Type :help for assistance.
  1> import Metal 
  2. MTLCopyAllDevices().forEach { device in 
  3.     print(device.supportsCounterSampling(.atDispatchBoundary)) 
  4.     print(device.supportsCounterSampling(.atBlitBoundary))  
  5.     print(device.supportsCounterSampling(.atDrawBoundary))  
  6.     print(device.supportsCounterSampling(.atStageBoundary))   
  7.     print(device.supportsCounterSampling(.atTileDispatchBoundary))   
  8. }
true
true
true
false
false
true
true
true
false
false
  9>

philipturner commented 1 year ago

I just got some exciting news regarding FP64 emulation performance. The overhead of function calls will not be the bottleneck, at least with 4-wide vectorization. I tested it on the Apple architecture, but would you be open to helping me test on the AMD architecture? This might decide whether a new Metal-based backend can be used for AMD, instead of OpenCL.

Random benchmark with two int32 adds per operation (not FP64 emulation).
// - Theoretical maximum speed: 10.4 TFLOPS
// - Fastest speed without a function call: 3.53 tera-ops x 2 adds (1:1.47)
// - Fastest speed with function call, 1-wide scalar: 183 giga-ops (1:56.8)
// - Fastest speed with function call, 2-wide vector: 360 giga-ops (1:28.9)
// - Fastest speed with function call, 4-wide vector: 701 giga-ops (1:14.8)
// - Given proper vectorization, function call overhead will not be the primary
//   bottleneck.

This would also help Intel Mac users running other computational chemistry libraries (like INQ), which use double precision for all calculations and don't use OpenCL. Metal would avoid needing to boot Linux for ROCm.

theschles commented 1 year ago

I just got some exciting news regarding FP64 emulation performance. The overhead of function calls will not be the bottleneck, at least with 4-wide vectorization. I tested it on the Apple architecture, but would you be open to helping me test on the AMD architecture? This might decide whether a new Metal-based backend can be used for AMD, instead of OpenCL.
Random benchmark with two int32 adds per operation (not FP64 emulation).
// - Theoretical maximum speed: 10.4 TFLOPS
// - Fastest speed without a function call: 3.53 tera-ops x 2 adds (1:1.47)
// - Fastest speed with function call, 1-wide scalar: 183 giga-ops (1:56.8)
// - Fastest speed with function call, 2-wide vector: 360 giga-ops (1:28.9)
// - Fastest speed with function call, 4-wide vector: 701 giga-ops (1:14.8)
// - Given proper vectorization, function call overhead will not be the primary
//   bottleneck.
This would also help Intel Mac users running other computational chemistry libraries (like INQ), which use double precision for all calculations and don't use OpenCL. Metal would avoid needing to boot Linux for ROCm.

Here to help @philipturner ! Please make sure to @theschles so my Github notification icon lights up?

philipturner commented 1 year ago

I'll remember to do that. I'm be off for winter break very soon, and I hope to finally complete metal-float64. I'll let you know when it's time to test it out. Once the project's complete, I can get to work on the OpenMM Metal backend.

philipturner commented 1 year ago

@theschles this isn't entirely related to OpenMM, but I've been trying to figure out something strange about Unreal Engine 5 and Apple. Apple supposedly made a certain hardware instruction on the M2 GPU, just to run Nanite. I'm wondering whether Apple also exposed this instruction on AMD GPUs, through the Metal atomic_max_explicit. clinfo says your GPU's hardware supports them (cl_khr_int64_extended_atomics).

Are you open to checking out this directory and testing the script there? The README should give instructions. The boolean you have to flip is here, and I'd like to know the behavior with both true and false. On an unrelated note, I've recently emulated in-place 64-bit atomics on the M1 GPU, so the Nanite workaround might be unnecessary. I'll have you test it on AMD once metal-float64 is complete and its benchmarks are fully automated.

theschles commented 1 year ago

Hi @philipturner just saw your message (crazy at work). I’ll try that out sometime in the next few days…

philipturner commented 1 year ago

Also, would you mind running the following in the Swift REPL? Copy what it prints into a comment. Repeat that ~5 times, and say whether you get devices in a different order.

import Metal
print(MTLCopyAllDevices().map { $0.name })

theschles commented 1 year ago

Also, would you mind running the following in the Swift REPL? Copy what it prints into a comment. Repeat that ~5 times, and say whether you get devices in a different order.
import Metal
print(MTLCopyAllDevices().map { $0.name })

Hi @philipturner

Same every time:

Output

```swift MoltenCL on  main [⇣?] ➜ swift repl Welcome to Apple Swift version 5.7.2 (swiftlang-5.7.2.135.5 clang-1400.0.29.51). Type :help for assistance. 1> import Metal 2> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 3> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 4> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 5> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 6> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 7> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 8> print(MTLCopyAllDevices().map { $0.name }) ["AMD Radeon Pro 560", "Intel(R) HD Graphics 630"] 9> ```

philipturner commented 1 year ago

I only ask because I'm designing metal-float64 and the OpenMM Metal backend to support AMD GPUs. Even if they end up using OpenCL, it's not too hard to add a little extra logic in case.

theschles commented 1 year ago

@theschles this isn't entirely related to OpenMM, but I've been trying to figure out something strange about Unreal Engine 5 and Apple. Apple supposedly made a certain hardware instruction on the M2 GPU, just to run Nanite. I'm wondering whether Apple also exposed this instruction on AMD GPUs, through the Metal atomic_max_explicit. clinfo says your GPU's hardware supports them (cl_khr_int64_extended_atomics).

Are you open to checking out this directory and testing the script there? The README should give instructions. The boolean you have to flip is here, and I'd like to know the behavior with both true and false. On an unrelated note, I've recently emulated in-place 64-bit atomics on the M1 GPU, so the Nanite workaround might be unnecessary. I'll have you test it on AMD once metal-float64 is complete and its benchmarks are fully automated.

Hi @philipturner

emulating64BitAtomics = false:

2023-01-11 08:08:41.854304-0800 foo[79021:1885378] Metal GPU Frame Capture Enabled
2023-01-11 08:08:41.856261-0800 foo[79021:1885378] Metal API Validation Enabled
validateNewTexture:79: failed assertion `BytesPerRow of a buffer-backed texture with pixelFormat(MTLPixelFormatRG32Uint) must be aligned to 512 bytes, found bytesPerRow(16)'
(lldb)

===

emulating64BitAtomics = true:

2023-01-11 08:11:10.875466-0800 foo[79181:1890911] Metal GPU Frame Capture Enabled
2023-01-11 08:11:10.876907-0800 foo[79181:1890911] Metal API Validation Enabled
2023-01-11 08:11:11.773086-0800 foo[79181:1890911] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
2023-01-11 08:11:11.773787-0800 foo[79181:1890911] MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 1 try
2023-01-11 08:11:21.864531-0800 foo[79181:1890911] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
2023-01-11 08:11:21.864654-0800 foo[79181:1890911] MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 2 try
2023-01-11 08:11:31.930745-0800 foo[79181:1890911] Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
2023-01-11 08:11:31.930867-0800 foo[79181:1890911] MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 3 try
foo/main.swift:40: Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=CompilerError Code=2 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}
2023-01-11 08:11:31.940098-0800 foo[79181:1890911] foo/main.swift:40: Fatal error: 'try!' expression unexpectedly raised an error: Error Domain=CompilerError Code=2 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}
(lldb)

philipturner commented 1 year ago

That's all I needed. Thanks!

philipturner commented 1 year ago

@theschles would you mind running the Nanite atomics test again? I added new patches.

theschles commented 1 year ago

@theschles would you mind running the Nanite atomics test again? I added new patches.

Roger that. Apologies, have been out sick for the last week.

Also have another bit of fun just before illness struck: the screen-backlight of my Intel-based 2017 MBP with the AMD Radeon died. I'm still running it, although now it's connected to an external monitor. It's basically now acting as a Mac Mini as the cost to repair doesn't make sense. I thus still run AMD tests on MacOS.

Meanwhile so I can have portability, I've purchased a refurb 2021 MBP M1 Max 14" with 32GB RAM. I'm ready to try out GPU processing on it :)

philipturner commented 1 year ago

Now that I've found the issues, I don't think we need to test Bootcamp anymore. If you get the 2021 MBP, the ProMotion will be a game changer. The old M1 Max is fine; you don't need M2 Max to experience it. Did you get the 24-core or 32-core version?

philipturner commented 1 year ago

@theschles I'd like to archive this repository. Would you mind moving relevant discussion to LinkedIn DMs or OpenMM threads? I got someone else with an AMD GPU to perform the relevant testing for Nanite.

philipturner / MoltenCL

Native FP64 support on AMD GPUs #1