intelvwi / DSG

Distributed Scene Graph
20 stars 13 forks source link

Sort of got it working #21

Open socialconcept-dev opened 11 years ago

socialconcept-dev commented 11 years ago

Server: Dual Xeon with 24 gigs OS: Centos 64 bit OpenSim version: Latest from osgrid Platform: mono OpenSim mode: Grid Asset server: SRAS

Well... With very few instructions, I did get DSG up and running. Absolutely no idea if it’s correctly configured or not. A lot of really strange behaviors going on.

I am completely lost here and have no idea where to start. Assuming my guesstimates are correct, I should have three copies of my test region running:

Scene Persistence Simulator Physics Engine Simulator Client Management Simulator

All of the above simulators contain their own db’s and all show up as an additional region on my grid. Is this correct? So we run multiple duplicates of all regions we want to enable DSG on?

I login to the Client Manage Simulator to visit the region. Is this correct? I’ll try and post a list of the events that occurred:

-- First thing that happens is all assets are stripped from the Client Management Simulator on startup. All that is remaining is Pimple Island. I have absolutely no idea of why this is happening, but all you see is a bazillion of the following error messages flooding the console at the end of the startup process.

ERROR - OpenSim.Region.Framework.Scenes.EventManager [EVENT MANAGER]: Delegate for TriggerObjectBeingRemovedFromScene failed - continuing. Object reference not set to an instance of an object at DSG.RegionSync.RegionSyncModule.SendSpecialUpdateToRelevantSyncConnectors (System.String init_actorID, System.String logReason, UUID sendingUUID, DSG.RegionSync.SymmetricSyncMessage syncMsg) [0x00000] in :0 at DSG.RegionSync.RegionSyncModule.OnObjectBeingRemovedFromScene (OpenSim.Region.Framework.Scenes.SceneObjectGroup sog) [0x00000] in :0 at OpenSim.Region.Framework.Scenes.EventManager.TriggerObjectBeingRemovedFromScene (OpenSim.Region.Framework.Scenes.SceneObjectGroup obj) [0x00000] in :0

There are no issues with the asset server… If I turn off the [RegionSyncModule] all functions normally.

--- The Scene Persistence Simulator and Physics Engine Simulator seem to be setup correctly, or at least I guess. Here is the startup message from the Scene Persistence Simulator.

WARN - DSG.RegionSync.RegionSyncModule [REGION SYNC MODULE]persist/dsg SyncStart - Sync listener is local WARN - DSG.RegionSync.RegionSyncModule [REGION SYNC MODULE]persist/dsg: listener addr: 198.91.176.178, port: 7005 WARN - DSG.RegionSync.RegionSyncModule [REGION SYNC MODULE]persist/dsg: Starting SyncListener

--- When I login to the Client Manage Simulator, you can see all sorts of activity on both the Scene Persistence Simulator and Physics Engine Simulator consoles. Again, without documentation, I have no idea what I’m supposed to be looking for, but it must be working, as there are communications between all simulators occurring.

--- I did manage to, in a crude way, solve the deletion of all my assets from the Client Manage Simulator on startup… Here’s how: Started up all DSG simulators… Then installed an OAR of the region on the Client Manage Simulator. This seems to work to some degree. At least now, I can restart the Client Manage Simulator and it does not strip all the assets.

Regardless… I am still stormed with a barrage of these messages on startup:

ERROR - OpenSim.Region.Framework.Scenes.EventManager [EVENT MANAGER]: Delegate for TriggerObjectBeingRemovedFromScene failed - continuing. Object reference not set to an instance of an object at blah blah blah….

Ok well, at least the region is in tact, even if DSG seems to believe it’s deleted all the assets.

--- I’m also seeing a bunch of these messages as well:

WARN - DSG.RegionSync.RegionSyncModule [REGION SYNC MODULE]client1/test SyncOutUpdates(): An update thread is already running. WARN - DSG.RegionSync.RegionSyncModule [REGION SYNC MODULE]client1/test SyncOutUpdates(): An update thread is already running.

And when I log out of the Client Manage Simulator, hundreds of these messages flood the console for almost 3 minutes:

[SYNC INFO MANAGER]: UpdateSyncInfoBySync SyncInfo for 153ab763-158f-4e5c-b041-16355623200c NOT FOUND.

--- Performance:

Not so good. When I login to the Client Manage Simulator, it takes 5 minutes for the entire scene to complete. Physics is laggy for the first 10 minutes and the entire scene /physics is jittery. Once it settles down it seems to be fine.

Both the Scene Persistence Simulator and Physics Engine Simulator processes are consuming 45% of a single CPU core each. They will eventually settle down if I sit on the region long enough. But if I start moving around, they can spike to as much as 100% of a single core.

I logged in with a second AV… The scene would not even render for this one. In some cases, upon logout, the Scene Persistence Simulator pins a single CPU core at 99% until I shut it down.

I don’t understand what role ActorID = plays. Is this simply an ID for the process, or is there some other significant role it plays?

Very promising development for OpenSim and one we’re more than happy to help with, but we do need to verify it’s actually configured correctly. Any help would be very much appreciated.

radams1 commented 11 years ago

Your basic configuration is correct -- each of the 'actors' (the DSG term for the processes that act on the scene (physics, ...)) is hosted in its own simulator. So, yes, you do end up with a region for each of the actors. Your configuration is correct in having three regions (physics, persistence and client manager) that each have their own X and Y location and unique UUID.

The synchronization concept is that the regions for each of the actors synchronize with each other so if something changes in one region (physics moves a ball) that property update is forwarded to the other regions. The synchronization includes object creation as well as movement.

The job of the 'persistence' agent is to store/persist the content of the world. The way I usually run DSG is to have the object database on the persistence server and null storage for all the other agents. This makes it so when a physics agent, for instance, connects to the persistence agent the empty physics agent will load everything from the persistence agent. You setup null storage (in-memory only storage of region contents) by changing OpenSim.ini: [Startup] storage_plugin="OpenSim.Data.Null.dll" storage_connection_string="" and config-include/GridCommon.ini [DatabaseService] StorageProvider="OpenSim.Data.Null.dll" ConnectionString="" EstateConnectionString="" The configuration should have a database (MySQL, ...) on the persistence server and the physics, script and client manager would have the null storage. Then, when you load the OAR file on the persistence server, it will be distributed to the other actors. That should fix the things-getting-deleted problem.

The "an update thread is already running" is a common occurrence when starting up. It says that the synchronization update thread was already busy sending updates when the next simulator heartbeat happened. It is non-fatal and we are, at this very moment, working on speeding that processing up.

I have not seen the 90% and 100% CPUs with just one or two avatars. I am most interested in hearing if that is still happening when you try the null storage configuration mentioned above.

Hope that helps a little.

socialconcept-dev commented 11 years ago

It helped a great deal :)

Most certainly filled in a lot of blanks and I do appreciate it. I committed most of the weekend to DSG. Kept a running journal of everything I encountered. Someone of it may be of help to you, while other stuff you may already be aware of. Yeah it’s a little verbose, but that’s better than ambiguous when reporting tech issues.

Here we go…

First some more info that might be helpful:

Got everything up and running according to your instructions. Started doing a few tests...

-- Everything hangs if logging into a DSG region as your default location. Both Scene persistence and Physics agent CPU processes pinned at 65%, to 80% of a single core each until I log out.

Correction… If you wait for 4, to 5 minutes, you will eventually render and so will the region. Consumes a whack of CPU in the process, but it will ‘eventually’ render. Not sure what would happen if a bunch of AV’s logged in at one time.

Workaround... Region loading time is BLAZING fast if you TP into the Client Manager Simulator from a none DSG region.

--- Scene persistence agent, or possibly Physics agent appears to fall asleep periodically. You try walking and you just slide across the region. During this time, the Scene Persistence Agent is showing idle at the console. Eventually it wakes up and you begin walking again. Seems to be more of an issue when more than 1 AV is on the region. Both agents should 'always' show some activity if there is movement of any type on the region.

Correction… Once AV2 is on the region for 5 minutes, and or TP’s between a DSG and ‘none’ DSG region a few times, this problem disappears.

--- Terrain textures are lost when loading an OAR into the Client Management Simulator, so now you have the default terrain colors. Terrain light setting defaults to daytime. So much for my moonlight builds ):

--- With the other DSG agents set to nulldata, these simulators need the master AV /the owner of the region to ‘manually enter the master AV name’ each time it’s restarted –it does not save them, probably because no DB is present. Wonder if there is a way around this.

--- Region name is in tact, but parcel description is lost on DSG regions. Parcel options are lost as well:

About Land --> Options --> All settings are lost from the original region. Edit Terrain… Fly… Create Objects… Object entry… And Run scripts are all enabled now. Teleport routing is defaulting to ‘Blocked’.

Possible Workaround… You need to start the Client Manager Simulators with a DB and an original copy/ OAR of the region. This is the only way you’ll retain ALL of your settings from the original region. Yeah, you need to deal with a barrage of error messages on startup, but at least the region is fully in tact. No idea if this results in a performance hit or not.

Emmm…. Actually that barrage of messages IS a problem. All scripts are wiped out because of this issue again:

ERROR - OpenSim.Region.Framework.Scenes.EventManager [EVENT MANAGER]: Delegate for TriggerObjectBeingRemovedFromScene failed - continuing. Object reference not set to an instance of an object at DSG.RegionSync.RegionSyncModule.SendSpecialUpdateToRelevantSyncConnectors (System.String init_actorID, System.String logReason, UUID sendingUUID, DSG.RegionSync.SymmetricSyncMessage syncMsg) [0x00000] in :0 at DSG.RegionSync.RegionSyncModule.OnObjectBeingRemovedFromScene (OpenSim.Region.Framework.Scenes.SceneObjectGroup sog) [0x00000] in :0 at OpenSim.Region.Framework.Scenes.EventManager.TriggerObjectBeingRemovedFromScene (OpenSim.Region.Framework.Scenes.SceneObjectGroup obj) [0x00000] in :0

Yep… It blasted all the scripts ):

If you try to use a poseball, you see these messages:

[SYNC INFO BASE]: UpdatePropertiesBySync: Error in updating property IsColliding: Object reference not set to an instance of an object

I am utterly LOST here… If we set all agents, except the Scene Persistence agent to OpenSim.Data.Null.dll, AND we disable Xengine on the Persistence Simulator, then how do we actually grab the scripts for loading into the client management agent in the first place? How would the Xengine agent actually receive them?

Re-enabled the Xengine process in a final effort to get scripts working, but no luck as of yet.

--- If you shut other agents down, or attempt to restart them, the Scene Persistence Simulator will pin @ 100% of a single core. Actually… If you try to restart any of the other agents ‘without’ restarting the Scene Persistence Simulator as well, the Scene Persistence Simulator will drive CPU load up to 100%, to 400%.

Here’s how… As I shut down each agent, the Scene Persistence Simulator consumes another core. So… Shut down the Client Management Simulator – 100%…. Shut down Physics… 200%… Shut down Xengine… 300%… Shutdown Client Management Simulator… 400% or 4 cores fully utilized until the Scene Persistence Simulator is shut down. Obviously, you can’t restart a single agent –you must restart them all. And you must ‘manually’ log into all of them with the Master AV name again.

Beware… If you shut down one agent, you commit to shutting them all down and fast! The CPU load will quickly rise to a point where the entire VPS becomes unresponsive.

One obvious concern here is the unpredictable amount and duration of CPU used for the login process. Scene Persistence, Physics, and Xengine agents can consume 50% to 150% or 1.5 CPU cores (Each) for the first 1, to 5 minutes an Avatar logs in. No idea why or how it fluxuates this badly. These tests were carried out with near Ruth style AV’s.

On a side note… Too bad we could not cache the entire scene for each AV. The only thing the viewers should be updating is new Avatars upon arrival to a region --not the entire region. If they're 'returning' guests, why call for a reload of the entire scene each time they return? Insane amounts of unnecessary overhead… OpenSim actually used to cache it this way. You’d return to a region and it would flash appear –no reloading. Why this was changed is anyone’s guess.

The real problem is if 10, 20, 30 or more people began entering the region simultaneously. At this rate, I'd need 60 CPU cores to ensure I have enough resources for the scene loading of multiple inbound guests, lol

--- Attachments, such as hair randomly disappear from AV’s on the DSG region, and not when you TP in. It’s like suddenly… The hair just vanishes. In other cases, the AV’s are missing their necks. Head and torso fully in tact, but no neck.

TP’ing back and fourth between a DSG and none DSG region will sometimes fix this, but nothing consistent I observed here.

--- 18 hours later….

Tried logging into the DSG region. Scene loading was very fast. Unfortunately, the AV froze again. It’s trying to walk, but can’t. Right foot goes forward. And as expected, the hair flew off about 1 minute later. TP’ing out of the region is no problem… TP back to the DSG region and you can walk again.

--- Multiple AV test on a DSG region.

Worked well until the 4th AV hit the region. At this point, the Persistence Agent spiked the CPU to 180% or almost 2 cores. It took 10 minutes for it to settle down. All seemed well, so I started moving the avatars around. After about 30 seconds, all AV’s were frozen. Again, it looks as if Physics has gone to sleep. Both the Physics and Persistence agent are idle at this point –no CPU activity.

Waited an hour… Finally logged all 4 of them out. Persistence Console is flooded with the following messages with no end in sight:

[SYNC INFO MANAGER]: UpdateSyncInfoBySync SyncInfo for 6df30b30-0034-11e2-8f61-aa00f1792c01 NOT FOUND.

Hmm... CPU now reaching 150% and rapidly climbing. Had to manually kill the persistence process.

Most everything I’ve mentioned here I was able to duplicate at least twice. Thing is… A lot of the problem behaviors are exceptionally randomized, thus making it difficult to determine the difference between a problem, versus some odd behavior, which will go away if the wind changes directions.

I am committed to this project. Just let me know what else you want me to do :)

Much thanks again.

socialconcept-dev commented 11 years ago

I’ve tried another bunch of variations of the Xengine agent. I can confirm that most interactive scripts are not being sent to the Client Management Simulator. For example, scripts, such as snow, fire, and anything that does not require user interaction work. Doors, dance balls, etc do not work. All of them function as expected on the Scene Persistence Simulator.

kittyfly commented 11 years ago

In DSG setup, users are only supposed to login into client managers. They are called client managers because they deal with client communications. Script engine takes care of script excution, physics engine carries out physics simulation, and the underlying DSG sync protocol (more appropriately, update propogation protocol) propogate the state updates made by each engine and glue them together for consistent state across the engines. The updates are then sent by client managers to the viewers.

As we said, we should have better documentation to explain the big picture. Will do so soon.

Interactive scripts work, as we have tested out internally and publically. There might be some configuration issues on your setup. Persistence engine (Scene Persistence Simulator) is not supposed to run scripts, unless script engine is configured to co-located with Persistence -- in which case, there should not be another script engine then. Not sure what you really meant by "scripts not sent to client managers" -- there should not be scripts sent to client managers, rather, when script engine executes the scripts, any state changes made by the scripts, such as changed position, scale, color of an object, will be sent to client managers which then forward on to viewers.

kittyfly commented 11 years ago

Please take a look at the wiki page. We updated it with some instructions on how to configure and run a DSG system.