Research implementing a task based threading implementation

peterclemenko commented 12 years ago

A task based threading implementation would allow for better performance on modern hardware. Recent tests I have run indicate that due to a lack of threading is causing some serious performance issues, most evident while running the Physics examples.

A possible implementation for threading could be based on a work stealing task queue implemented in the cpptask library: http://code.google.com/p/cpptask/

antont commented 12 years ago

Running well on multicore devices was actually on table already back 3,5 years ago when Naali (which later became Tundra) was still on the drawing board. Ryan was leading the planning then and pointed us to a then new demo by an Intel research guy about a parallel game engine architecture.

We refered to that briefly in an early investigation in http://playsign.net/engine/ViewerArchitectures#a-design-for-a-parallel-game-engine-aka-the-intel-paper-smoke-demo . I think there was more writeups about the matter in the 'realXtend next generation' planning docs in the old wiki back then, but that's not online anymore and is probably outdated and irrelevant now anyway as time has passed. That Intel thing is however at http://software.intel.com/en-us/articles/smoke-game-technology-demo/ and perhaps interesting to check still. Basic conclusion there was that 1 thread per core is good, and that subsystem division (modules/plugins) and execution division are good to have separately.

Later in 2010 Ryan also made a 'framework2' experiment to overcome this problem, a rewritten core that could replace the initial one which is still current -- that's called Scaffold, and there everything is made in Tasks and the architecture is interestingly unified overall (for example the entity system is more inherit, e.g. the main application itself is also an entity, anything can subscribe to know about changes anywhere etc), https://github.com/sempuki/scaffold

In the talks back then Jukka figured that the Smoke system is basically a Software Transactional Memory (STM) implementation, which was also I think a bit cool and new then. It was then seen as too complex to start writing the whole project around, so the current simple singlethreaded mainloop etc. was made instead.

During the work later on, regarding threads and parallelization, it seemed ok that it can be dealt inside modules. That is, the core system does not force everyone to write everything with e.g. Tasks that the core could then schedule and run in parallel etc. Instead, any module/plugin is free to do threading etc. how it wants, what is best for the job there.

There has been quite recent talks about dealing with the physics load, one idea that the Chiru folks I think looked into a bit was simply running physics in a separate Tundra instance, which is a different way to parallelize / offload the scene server mainloop. I'm not sure if they got actually to test that.

What is your issue / setup exactly? Were you running standalone, with both graphics and physics in the same process? In the typical client-server setup with realXtend services the servers are headless, so don't do drawing but mostly just physics and some logic depending on the service. And clients don't do physics but graphics only. So the load is already shared and this problem doesn't manifest the same way as in standalone. A problem that we do have had is when both scripting & physics load are high -- there one route would be also to separate script engines to new threads or processes / computers even.

Anyhow, thanks for the feedback and the pointer -- I haven't looked into this recently, will check that cpptask lib out, dunno if Jukka knows it already. Perhaps time is ripe for some nice new architecture here, or we can go with the idea of just solving this specifically in the Physics, Scripting etc. plugins. Some things already are threaded, like asset loading etc.

peterclemenko commented 12 years ago

Yeah, I've been running standalone. The idea I'm testing using RealXtend for is not a mmo/sl style viewer so much as a system for a singleplayer/multiplayer platform. While on one hand, a headless server may be able to get away with that, when it comes to singleplayer, it's not exactly ideal to run a separate server for physics on the same system. The other thing to note is that if each module handles it's own threading, that can cause race conditions if for example you have 8 modules that each run on their own thread, and a dual or quad core system with 2 or 4 threads.

jonnenauha commented 12 years ago

@th3flyboy As Toni noted there, plugins can implement threading in their logic if they want to. We already give multiple ways (deps) to work with, so you can even pick and choose what you like :)

First one is boost threads (this is what Ogre uses for example). Second one is Qt threads. Both are multiplatform and probably more feature rich that your cpptask library. So I don't think we need to introduce a new dependency at least for this.

I think there is not much point in framework providing some TundraTask class that takes in whatever and signals a result when done etc. Modules can do threading or task workers them selves too. I mean we can do some kind of worker thing, we already had that in Naali but it got removed. I just wonder if we make a generic thing, will it even suit most needs and be used at all.

Here is for example how I threaded mumble network and audio, both to their separate threads. It was some work to get going but once its done the whole voip thing has a very small impact on the mainloop even with lots of traffic in voice.

Entry https://github.com/realXtend/naali/blob/tundra2/src/Application/MumblePlugin/MumblePlugin.cpp#L210 Network https://github.com/realXtend/naali/blob/tundra2/src/Application/MumblePlugin/MumbleNetworkHandler.cpp#L94 Audio https://github.com/realXtend/naali/blob/tundra2/src/Application/MumblePlugin/AudioProcessor.cpp#L99

You have to also remember that you cannot access Framework or its APIs or other modules for that matter from other threads. This is not safe. If you need things from the framework to your worker thread you need to send a event to the main thread and push your results back to the worker. I don't think you will have any race conditions etc. even with many modules doing threads, if they do it correctly.

So I wonder if this ticket should be called "Thread physics module as much as we can" instead? That is what you are having issues with, right? :)

peterclemenko commented 12 years ago

Physics is one aspect, however I have a feeling that just putting it in a separate thread for now would be enough. It looks like Ogre may implement their own tasking/threading system in the next release or two, so it may be better to wait and see what Ogre does. As for Bullet, one thing to consider would be moving it to another thread and using OpenCL if possible. It needs to be noted however that the future of computing (desktop and mobile) is looking like more cores, not faster cores, if you don't adopt threading with open arms, you are going to have the same problems Flightgear and FSX had, you won't be able to run much on the compute side (which kills more complex stuff) because you are aiming at one CPU, rather than n amount of CPUs, you will hit an upper limit on the amount of power that can be used.

antont commented 12 years ago

Yes, the fact that future of computing seemed to be multiple cores, not faster CPUs anymore, is the very reason why task based and other parallel solutions were on the table right in the beginning.

I figure in practice it can still be quite good if we have e.g. voice, video and physics things in separate threads or processes. And perhaps scripting etc.

So yes separating just physics now might help a lot -- am afraid it is not trivial, though, due to the interactions between the main ogre scene and the physics scene. Bullet does already AFAIK have a copy of the scene (has an own physics scene) so might be quite straightforward too (the Intel Smoke system works so that all the parts have an own separate copy of the whole scene state, so can work on it without locking etc).

peterclemenko commented 12 years ago

A few resources that may be helpful that I've been reading over include: http://bitsquid.blogspot.com/2010/03/task-management-practical-example.html http://bitsquid.blogspot.com/2009/10/parallel-rendering.html http://bitsquid.blogspot.com/2012/04/inheriting-velocity-in-ragdolls.html

jonnenauha commented 12 years ago

Ogre is not threading its main logic and rendering. It seems to me they are just going back on threading asset loading properly. Ogre will still be meant to be ran inside the main thread and this has been said multiple times from ogre devs, they are not targeting to thread the whole thing. In ogres case its seems we need to replace it with out own if we really want to do this efficiently. However currently I don't think this should be the reason to jump away from Ogre, the more important thing for us is now its rendering speed in general. You can't optimize things just by putting then on another thread and say its "done", you will get big list of new problems once you start threading things.

For physics as Toni said, it will probably not be trivial. Not just that we keep the Ogre scene and bullet scene pretty much synced (for EC_RigidBody ents at least), but as I said earlied, any Framework or the Core API acceess is not guaranteed to be safe from the worker thread. So you still need to do some back and forth there (push the latest data from main thread to the physics worker). I think additional problems will arise with Qt signaling, they can be done right with threads or they can be done wrong. Qt signals can be made to trigger in the working thread and the worker threads physics signals can trigger slots in the main thread. This is automatic from Qt as long as you use it correctly.

I think we will look into threading bullet if bottle necks are really found that show us "right, threading this will clean this stuff out" but I think we have lots of other optimizations to be done before "just thread it" becomes the best thing to do. There are already some tickets what those things are and we kind of know the big spenders.

realXtend / tundra

Research implementing a task based threading implementation #501