kleinerm / Psychtoolbox-3

This is kleinerm's git repository for development of Psychtoolbox-3. Regular end users should stay away from it, unless instructed by him otherwise, and use the official Psychtoolbox-3 GitHub page or distribution system for production releases.
104 stars 304 forks source link

Pull in experimental Vulkan support for macOS 10.15.4+ #203

Closed kleinerm closed 3 years ago

kleinerm commented 3 years ago

This pull adds experimental basic Vulkan display backend support for macOS.

It requires at least macOS 10.15.4 Catalina to function, but is only tested on macOS 10.15.7 Catalina final. It requires Khronos MoltenVK 1.1.1 to be installed in the system globally, ie. /usr/local/lib and /usr/local/include.

MoltenVK is an open-source Vulkan portability subset implementation of Vulkan 1.1 for macOS, built on top of the Metal graphics rendering api.

In principle this works as on Linux and Windows, ie. OpenGL-Vulkan interop is used to render content in OpenGL, then transfer it to Vulkan for display. Details of interop differ due to deficiencies of Apples OpenGL implementation and current limitations of Vulkan for macOS:

  1. We do not export VKImage backing VkDeviceMemory and import it into OpenGL textures as GL_memory_objects, because Apple's prehistoric OpenGL implementation does not support the requires extensions. Instead we use MoltenVK specific extensions to create backing IOSurface's for the VKImage interop image. Then we get a handle to the IOSurface and use Apple macOS OpenGL specific extensions to import that IOSurface as a rectangle texture. This has some limitations: No MSAA textures possible for interop, so MSAA implies our own MSAA resolve on the OpenGL side.

  2. Apples OpenGL does not support semaphores, so we only use a glFlush() for producer-consumer sync.

  3. The only way to get blocking on present completion and timestamps of stimulus onset is via the GOOGLE display timing extension, which is implemented by MoltenVK on top of Metal. Our standard double-buffer trick or vkAquireImage() method does not work at all.

Some serious limitations of this implementation:

  1. MoltenVK as of 1.1.2 has a buggy implementation of display timing, which does not dequeue presentation timing records ever. We need to work around that spec violation until i manage to upstream proper fixes to MoltenVK, or somebody else does.

  2. At least on macOS 10.15.7, Direct-to-display (DtD) mode is seriously buggy and causes random malfunctions, so we disable it. This is not a big loss, as DtD also does not remove the 1 frame extra compositor latency, contrary to what Apple PR promises. Even in DtD, the DisplayServer/Compositor is still involved in present scheduling and due to deficient scheduling of the DisplayServer - one service pass starting about 2-3 msecs into each video scanout cycle - defers present of each present-ready frame by one extra video refresh cycle. This means a Screen('Flip') / Vulkan present can only happen at most every 2nd video refresh cycle, limiting framerate to half the video refresh rate if actual wait for stimulus onset + timestamping is needed and thereby PTB throttled on present completion. Iow. in 99% of all useful visual stimulation paradigms, performance will be severely limited.

This tested on a MacBookPro 2017 with AMD Radeon Pro 560 under macOS 10.15.7. This problem does not just affect PTB / MoltenVK, but was also verified by using original Apple "best practices" Metal sample code. Iow. this is a macOS Metal bug, unfixable by us. Using Apples "Instruments" Metal trace tools, i could confirm this behaviour at the lowest level.

  1. Timing and timestamping is unreliable! Sometimes, sporadically, timestamps are reported as 0, for no obvious reason. This happens a lot during the first few presents, or after a long pause between successive presents, ie. more than a few hundred msecs. Again, by modifying Apple "best practices" Metal sample code, and also by tracing with "Instruments", this has been tracked down to be not a PTB or MoltenVK/Vulkan bug, but bugs in macOS Metal implementation, iow. not our fault, not fixable by us, only by Apple. Things that were verified to not matter:
    • If rendering happens on the application main thread (GUI/Event thread) or a secondary thread. Apples sample allowed to test and refuse this hypothesis. Mods to our own code also showed no difference.
    • Peculiarities of event handling. Although adding printf statements in PsychVulkanCore/PsychVulkan/Matlab script/ MoltenVK seems to help a bit in PTB's case, but not in the standalone Apple sample Metal app.
    • DtD vs. standard fullscreen mode. Although DtD seems to work better for Apples Metal sample code, whereas in windowed mode we observe a lot more timestamping failures.
    • Time intervals between presents seem to somehow matter somewhat sometimes. Results are inconsistent across runs and machine reboots though. Sometimes present intervals < 33 msecs or > 250 msecs cause failure. Other times the limits are 17 msecs and 500 msecs, sometimes we can go to multiple seconds before trouble happens.

All i know is that Metal is severely buggy, and i could not find any meaningful workarounds for the problems in multiple dozen hours of experimentation. This on macOS 10.15.7, MBP 2017 with AMD Radeon Pro 560 gpu.

That said, in the cases where timestamps are reported as non-zero, most of the time they seem to roughly agree with reality, as measured with our FlipTimingWithRTPhotoDiodeTest.m script and Videoswitcher+RtBox hardware reference timestamping. Timestamps are not as precise and high quality as those from our PTB kernel driver, but they are ok enough for many applications. Seems they are hardware vblank interrupt handler invocation timestamps without special high precision corrections -- Amateurish, but ok for basic use.

The implemented support also provides choice of 10 bpc and fp16 framebuffers and ColorCal-II measurements suggest the 10 bpc mode achieves 10 bpc precision and the fp16 mode achieves 11-12 bpc linear precision, however only by using an Apple proprietary dithering algorithm, not by using the gpu's fp16 scanout mode.

The implementation also supports very basic HDR display support. More limited than Windows-10, much more limited than Linux. But at least the basics are covered: A scRGB linear colorspace, same methods as used on MS-Windows in windowed mode. macOS does not allow query of the connected displays HDR properties. This requires at least macOS 10.15.7. Standard HDR-10 / BT2020 / PQ is not supported, so our encoding method is different from the high quality operating systems. Precision is worse than on Linux or Windows, as macOS applies its own proprietary tone-mapping over which we do not have significant control.

So in terms of the intended main purpose - getting trustworthy visual stimulation timing and timestamping without the need for our typical hacks - this current implementation is mostly a failure. But given the failure is due to Apple macOS Metal bugs, there is the faint hope that it may work better on other Apple hardware + macOS combos, or future macOS versions if Apple get their act together.

Would i trust this implementation on macOS for proper stimulation? Hell no! But maybe it is good enough for people with no or almost no need for reliable timing, or for people who need rather limited HDR support and don't want to switch to real operating systems for that purpose.

Work time spent so far: More than 150 hours.