pongasoft / glu

Deployment Automation Platform
Apache License 2.0
518 stars 99 forks source link

A bad agent should not bring the console down #262

Closed ypujante closed 10 years ago

ypujante commented 10 years ago

Follow up from this thread: http://glu.977617.n3.nabble.com/Urgent-Need-help-td4026195.html

JackKora commented 10 years ago

What would be helpful is the following behavior on the console:

  1. Handle the exception from the bad agent(s) and still continue to work otherwise.
  2. Log the exception (this already happens) but also log the agent that sent it. Then it’s trivial to go and dump the agents cache, look at its logs, etc.
  3. In the UI highlight the bad agent somehow. Maybe in the agents tab, maybe in the main view. This will help with inspecting the overall fabric for bad agents.

Agents can go bad every once in a while due to different reasons – in software development bugs happen :) But if console handles it gracefully then very little harm is done. Thanks again for your help!

ypujante commented 10 years ago

I agree. Note that the fix will go in the next version of glu (5.5.x).

ypujante commented 10 years ago

I was able to reproduce the issue on my machine. I do not know yet what the problem is but investigating...

ypujante commented 10 years ago

Fixed in 4.7.3 and 5.5.1

JackKora commented 10 years ago

Thank you! And thanks for patching the 4.x line.

ypujante commented 10 years ago

@ykorabelnikov you are welcome.

My understanding of the bug makes me believe that it happened because during the upgrade:

1) the agents stops and restarts. 2) When it restarts it needs to re-instantiate the glu script. 3) Prior to 4.6.2, the glu script was not stored locally and was being fetched from its original location (as defined in the glu model). 4) If the original location is not accessible, then glu cannot re-instantiate the glu script and simply ignores this entry (the agent itself is fine). 5) after booting, the agent synchronizes the filesystem with ZooKeeper (Syncing filesystem <=> ZooKeeper message in the agent log) 6) this step (was) blindly loading all the states from the filesystem (which are java serialized objects) and storing them in ZooKeeper as json object 7) the issue is that in 4.7.1 the format of the file has changed and so because of 4) and 6) you end up with the wrong format in ZooKeeper for those states that were ignored in 4) 8) the console then receives this invalid data and then fails

What I did to fix the issue: 1) during the boot process, the agent will upgrade old format to new format 2) if a state cannot be restored, it is moved to a separate location and a "dummy" InvalidStateScript is instantiated so that it will appear in the console with the proper stack trace so that you can identify what the problem is 3) on the console side, it no longer fails if one entry cannot be read but will also instantiate a dummy one so that it is not silent

Technically the console should not see this problem because of the fix in the agent, but if you use the new console with an old agent, then at least the console will "survive" and continue to be operable.

Both 4.7.3 and 5.5.1 have those fixes.