Question : Device handling, multi-device optimisations

elliotwoods commented 10 years ago

If I have 2 renderers outputting to 2 seperate devices. Is there a way to optimise rendering? (e.g. to parallelise draw calls, or is this already handled by the SwapChain?)

Alternatively, is there a way to control CrossFire / SLI features at all? My current understanding is that CrossFire only occurs on the final renderpass, and only on fullscreen renderers (i.e. unavailable on current fullscreen modes since they are not true fullscreen modes as I understand).

elliotwoods commented 10 years ago

p.s. apologies for posting here, not in forum, but this seemed to be a code/tech related question which may be outside the bounds of the forum's scope

mrvux commented 9 years ago

Missed a bit that one, quite a few things in there:

First, there is already multiple device support, making it parallel could eventually speed up (or slow down) depending on the use case. Resources would need to be created twice, so in cases where we are heavily cpu bound that could still help, in case we are gpu bound that's more likely gonna be a slow down

On this note, quite some work need to be done in the runtime (DX11Resource type needs a proper rethink, and get thread safe treatment).

Also most nodes update->render calls, we need to be aware that methods can be called concurrently, locking the whole method is not such a good idea, adding small locks where required would make the whole thing much more complex, so I'd say more thought are to be done to ease programming on this (i would really not want too much lock soup, plus that would slow down rendering in single device scenarios).

For render window, issue for now is that renderer is a usercontrol, and fullscreen is not allowed on such. Also only one fullscreen swapchain is allowed per device (feature which was available in vista but does not work anymore ;)

Proper fullscreen is often also required to remove tearing (funnily I noticed at times real full screen w/o vsync did not have tearing either).

I suppose multi window would need separate message pumps as well , so maybe some form of shared resource would work, I did try on multiple message loops, that rather work (also need to sync window events to vvvv ui thread, so it's not trivial either ;)

From some of my experiments, to allow better multithreading, moving from "immediate renderer" to some form of "retained renderer" could help did some tests on this, can be promising, but I guess will experiment on that on SharpDX version ;)

elliotwoods commented 9 years ago

i think a simple example use case might be having a very heavy pixel shader (e.g. TextureFX) running fullscreen on 2 devices to their respective video outptuts

currently (i presume) we call draw on each of them, with Renderer A first then Renderer B which means we're getting 1/2 fps than if we called the render calls on both GPU's simultanaously

so in theory it may be faster to call the render on both in parallel (e.g. to run 2 instances of VVVV, one for each device)

perhaps it's best to consider the most likely cases where multiple devices would be best to work independantly, and design patterns for those specific cases

however, i may be misunderstanding the current situation with draw calls (e.g. if the draws are being performed asynchronously to their cpu functions then perhaps this isn't an issue at all anyway)

mrvux commented 9 years ago

Yeah there's a lot of different cases where you could benefit or not from it.

As a side note, draw and pipeline calls are (almost) all async and sent to a command buffer (same for gl actually), but if you use the same device, commands are still serialized at gpu level (actually if you use the same card they are also serialized even if you have two processes).

Have multiple threads as said is doable, but require a decent amount of changes at node level (thread safe dictionaries, and forbid access to any IPluginIO for connection test which throws an exception.

Afaik PluginIO modification is done in SharpDX build (using cached .net property instead), Resource pool in there is thread safe (not needed for per device thread, but needed for deffered contexts), and there's some other bits which are better designed as "multicore friendly", so guess it's a better starting point ;)

mrvux commented 9 years ago

Actually, back on this one lately, since I started to experiment with multi device and multi core handling (plus finding a reasonable fullscreen model, this part is a mess ;)

From what I checked now, It's possible in options to have several render devices (different dx11modes are actually already implemented).

I also added a new mode which creates a device for every existing adapter right away (doesn't try to create / destroy those when moving renderers around). It actually makes it really easy to test on laptops as we can use both intel/nv gpus at the same time ;)

I added as a quick test a new windows node, that also takes device id as input,to force a window to use a specific adapter, which actually works pretty fine (Couple of nodes don't implement multi device internally, like quad layer, but that's quick to add again, most other nodes seems to work fine, would need of course a decent roundup to check them all).

From my tests, I'm also allowed to call present in several threads and wait for them all (no idea if it would give a reasonable performance boost, need to profile that more).

I also tried to have one thread per device for rendering, but Update method on resource provider is throwing me an error due to com apartment. It would be reasonably easy to remove it from the call (as i don't think any node even use it anyway), but would be a breaking change for anyone using custom plugins, Maybe a new interface and move native nodes towards it would be a better starting point.

I will test that one still on a test build, quite curious what type of performance boost/loss we would get from that.

elliotwoods commented 9 years ago

I did a little reading on the multi-gpu support in DX12, which suggests that the developer will have quite a bit of free reign over making pipelines across multiple adapters. Especially being able to transmit a buffer between one adapter and another is extremely interesting (not sure how that works with async)

How are you splitting your time between DX12 and DX11 right now?

Back to the case above... Yes definitely interested in multi-(thread*adapter) present, and what the bottlenecks might be. Do you forsee being able to chain together pipelines between multiple adapters (e.g. some compute on GPU 1 and some compute on GPU 2, both feed into a graphic on GPU 3).

mrvux commented 9 years ago

About multi gpu, you can share heaps (eg: memory) on several adapters indeed (with some restrictions, and of course you might need a round trip back to motherboard), you need to set fences and barriers yourself for the sync, so up to you as well.

For now doing more tech tests on dx12, just bunch of small standalones testing various features, got most working but coding model is different so did not decided any api design yet. Also might wait for Vulkan to pop around in order to see if an "abstracted" render path is a viable option (both api are really similar, and now there's also byte code for vulkan, so if a decent hlsl->spir-v compiler pops in then that's definitely a path to look at).

For now multi adapter (without render thread per adapter), works pretty fine, I need to fix couple of issues in render graph but that can be interesting specially in the case you are heavily gpu bound. Also adding a some optimizations (caching interfaces and "render roots"), to avoid calling 4v COM which crash in threading model. Also I was looking at having some nodes to also block parts of render graph depending on which adapter you are currently on (some form of validators for adapter/window id at layer level). Feeding data accross adapters in dx11, no plans for that right now.

Main issues I need to look is fullscreen, got it working properly with several screens and proper modes, but still some actions release it which is not too bad but annoying, and I don't know if windows allows it in multi adapter setup, need to set desktop comp again to test :)

mrvux commented 9 years ago

As a side note, work to move towards better multithreaded/multi device is ongoing.

I did some tests and results are pretty promising.

Sadly for now interfaces are using IPluginIO, trying to access that object in another core throws exception.

So those have been deprecated (eg: obsolete): see https://github.com/mrvux/dx11-vvvv/issues/229

Next release (imminent) will keep old interfaces, but already mark as obsolete. Once this is out they will be removed so we can work towards that.

elliotwoods commented 9 years ago

Looking forwards to a new release!

So when you 'share heaps on several adapters', you don't explicitly copy assets between the adapters? you just access the resource directly from the other adapter and the DMA operation happens transparently? (just have to be careful about locking resources? that's what you mean by 'fences and barriers'?)

Interested about Vulkan. Are there any simple examples of how you might make a graphics + compute workflow? What's the 'shader language' there?

Totally understand about the threading issues. You mention that you can't use exceptions any more. Does that mean you can't use unhandled exceptions, or you can't use exceptions at all?

Great to hear about these developments!

mrvux commented 9 years ago

In vulkan they use GLSL, but now also use intermediate language (Spir-V) that is sent to the card. After there's not being so many news, so hope it's not going too much into "politics" and some progress gets done ;)

Doing multi workflow is likely same as dx12 (separate command queues).

Problem with exceptions is they are nasty, unhandled exception which occurs in separate thread = application goes poof, so I'd prefer to make sure that all api is solid and avoid that at all cost ;) In any case a node that throws exception is not too good since it takes time to trace back as well, and can unbalance rendering.

Did not try heap sharing cross adapter yet, so can't tell much about it for now, guess transfer is done via crossfire link if available, and back to main ram if not possible (windows display manager already does that sometimes even on dx11)

For this week I'll finish to polish build, guess will release tomorrow or wed, then I'll do that move (also likely some changes in window handling, but I'm the only one who did that afaik, so I won't break anything).

elliotwoods commented 9 years ago

but does that mean you can't use exceptions at all locally? i.e. can you have your own try catch inside the plugin functions. or are all exceptions prohibited

mrvux / dx11-vvvv

Question : Device handling, multi-device optimisations #178