iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.82k stars 608 forks source link

Migrate to Vulkan 1.2 (or 1.1 with timeline semaphores) on Android #4494

Open benvanik opened 3 years ago

benvanik commented 3 years ago

Now that Android is slowly starting to get timeline semaphores (via the extension or the rumored 1.2 core update) we can investigate cutting the support for 1.1 (or at least timeline semaphore emulation). Most of this is a projected product decision instead of a technical one: what consumers of IREE on Android would want to ship with Vulkan GPU support on systems without timeline semaphores. This was much less clear at the start of last year but now thanks to the existence proof of Mali's getting the extension we may be able to reevaluate, especially in consideration of who would want GPU support (vs CPU support) and when they would ship. For example, if we target end-of-2021 for shipping apps using the GPU on Android we may be safe to perform this cut sooner rather than later. Given that we will likely always need to ship a CPU path on Android for any app to be usable it's likely fine to progressively widen the device support by time (vs. by additional fallbacks/emulation). We still have desktops that support 1.2 today and a lot of the code generation and runtime work done there will directly benefit Android.

Bumping to 1.2 would get us assurances for features like descriptor indexing (reducing descriptor set overhead) and SPIR-V 1.5 as well as additional data types (8-bit/16-bit storage for shader storage buffers, better 64-bit support, etc). Most of these are present as extensions in 1.1 that are available on many devices anyway but not having to handle fallback paths would keep the entire stack (compiler + runtime) simpler.

No hard timeline here just would be good to do some legwork and figure out what we should be planning for this year.

antiagainst commented 3 years ago

While it would be nice to have a single version to rule all platforms, in reality, I wonder how likely that will be the case. Android is just the huge elephant. :D It would be fantastic if we can see v1.2 on Android but I don't know when that will happen or whether it will happen at all.

IMO, we should investigate on a per extension/feature basis instead of saying we blanketly require v1.2. As far as the implementation supports all the features we'd like (be it via a bunch of extensions or v1.2), we should be fine. Compared to relying on v1.2, the overhead is probing these extensions, which should be okay. Ultimately, I guess we need to push standardizing a bunch of compute features as profiles we'd like to see for IREE in either the Android OS or the Vulkan spec.

But yeah, agreed this is mostly a product question and we can decide when the product story is more full fledged.

For timeline semaphore specifically, we should certainly #ifdef 0 it away if the underlying device already natively supports it.

benvanik commented 3 years ago

The reason to have core spec versions - even if they are mostly rollups of prior extensions - is that probing is not as easy as you make it out to be: each new conditional bit of code dependent on the presence of an extension is a new uncovered codepath that needs testing and maintenance. Given our current situation with CI where we don't have coverage of the two switches we already have I'm making the prediction we'll add more new switches faster than we can bring up tests for them. I am tempted to say that no new switches get added unless a CI gets brought up with VK_LAYER_LUNARG_device_simulation to test them :)

Vulkan 1.3 (and matching SPIR-V version) is the rollup with all the OpenCL features required for clspv so I highly suspect that Android will be moving to that eventually no matter how much they want to keep dragging their feet.

TLDR: it's extremely expensive maintenance-wise and code-size-wise to maintain alternate infrequently-used device/environment-dependent codepaths so if we can at all avoid them it's worth doing so even at the cost of platform compatibility. Maybe when we have the GPU path far exceeding the performance of the CPU path on systems that haven't had a driver update since Jan 2020 we can re-evaluate how much we want to spend on it, but right now and over the next few quarters we have nothing shipping so it's pure tax :)

sjfricke commented 2 years ago

but not having to handle fallback paths would keep the entire stack (compiler + runtime) simpler.

So something I haven't seen explicitly spelled out anywhere is what is the "if these aren't supported, the device isn't supported and falls back to CPU"

If there is going to be a fallback regardless, why not just always aim for Vulkan 1.1 and that way the setup logic is always the same (check extension, enable, etc) regardless. If there is a 1.3 promoted extension desired the current situation gets worse while the "always 1.1" method would work the same

benvanik commented 2 years ago

I'm not sure I follow what you're getting at?

ScottTodd commented 2 years ago

So something I haven't seen explicitly spelled out anywhere is what is the "if these aren't supported, the device isn't supported and falls back to CPU"

Falling back from GPU to CPU is currently left up to applications to handle. For example: if you try to create a GPU device and encounter errors, run out of GPU memory, or have other GPU workloads come through and want to rebalance compute resources, then you could have your application create a CPU device and route execution to it instead.

I don't think the low level IREE library itself should take opinions on when/how to fall back from one device to another. A higher level library built on top of IREE could, though.

sjfricke commented 2 years ago

Falling back from GPU to CPU is currently left up to applications to handle.

If I have a device with only 1.0 Vulkan, IREE will fail to create a Vulkan instance, which yes the applications then decide to fall back to the CPU.

My question is "when" IREE will fail to create the instance, I guess IREE_HAL_VULKAN_EXTENSIBILITY_*_EXTENSIONS_REQUIRED is what I was looking for

benvanik commented 2 years ago

Correct; for when we create the devices. We also support wrapping devices from the hosting application (iree_hal_vulkan_wrap_device) which is welcome to require more than what we do but must at least satisfy the required extensions/features as returned by iree_hal_vulkan_query_extensibility_set: https://github.com/iree-org/iree/blob/7fa5674492cc00b8dbe23626f78c08e2ee09c809/runtime/src/iree/hal/drivers/vulkan/vulkan_device.cc#L700-L714 (enabled_extensions is what the needs to be passed in VkDeviceCreateInfo::ppEnabledExtensionNames, in addition to whatever the application itself may want)

sjfricke commented 2 years ago

I'm not sure I follow what you're getting at?

So if there are the required extensions, and 1.1 and 1.2 only difference is just wrapping some ext to core (most are still optionally supported extensions)... What is the reason/argument to not just make IREE "Vulkan 1.1 with required extensions" instead of having this 1.1 vs 1.2 per platform stuff at all?

benvanik commented 2 years ago

Ah, gotcha. It's about maintenance burden: it's not good to have a lot of difficult to test code around extension management if it's not required and if the minimum platform level can be raised to allow for simpler code with fewer failure modes that's a good thing. So I'd flip the question around: why have a bunch of additional code for handling things that no one needs? The answer may be the same ("someone needs Vulkan 1.1 support") but the implication is different: we can't afford to keep around every code path for all backends for all platforms for all versions and if something like Vulkan roll-up releases allows us to cull that combinatorial explosion then it's worth taking advantage of when we can.

We have the virtue at this point in time of being new code without legacy users: at some point we're going to lose that and be stuck with what we have for certain compatibility windows and the less we need to bring into that process the better.

benvanik commented 2 years ago

Oh but concretely I don't like that we have some #if android version switch thing - that's weird to me. I believe the reason was swiftshader not being 1.2. It looks like it is now and we can remove that #if.

sjfricke commented 2 years ago

it's not good to have a lot of difficult to test code around extension management

So for timelineSemaphore sure that can be assumed on 1.2, but if you want storageBuffer8BitAccess that is still optional on 1.2 so your "branch logic" just went from

if ( VK_KHR_8bit_storage == supported) {
     add_support()
} else {
     no_support();
}

to

if ( VkPhysicalDeviceVulkan12Features.storageBuffer8BitAccess == supported) {
     add_support()
} else {
     no_support();
}

so there is still an equal amount of permutation to maintain

benvanik commented 2 years ago

The above is oversimplifying: extensions are much heavier weight to deal with than a single boolean field in a composite structure that is queried anyway (they require all the code to query, track, and propagate that throughout the program). Are you arguing that we should support Vulkan 1.0 in perpetuity and have several thousand lines of code for managing extensions? There's definite tradeoffs here (if we need to support 1.1 for some time then that means we have to do the extra work, but if we don't then we shouldn't) but I'm not sure I know what you're expecting? :)

sjfricke commented 2 years ago

Are you arguing that we should support Vulkan 1.0

That is a fair counter argument

here's definite tradeoffs here / I'm not sure I know what you're expecting?

I am painfully aware of how much of a monster the Vulkan versioning/extension system became (and Android being at 1.1)... guess playing devil's advocate on what is actually important or not here

benvanik commented 2 years ago

guess playing devil's advocate on what is actually important or not here

It's useful! This issue has been lingering for awhile because we haven't been able to decide if we can cut off Vulkan 1.1. I had forgot the #if android existed - it looks like it was a transient thing that never got cleaned up and it's as confusing to the intent as you say. Given that we obviously support running on 1.1 with current code I'd say we just make it 1.1 everywhere and re-evaluate our min target when we have enough information to do so - or we age enough that there's no practically targetable devices without 1.2 :P