mesos / storm

Storm on Mesos!
Apache License 2.0
138 stars 66 forks source link

#202 - clojure->java Supervisor, fix SuicideDetector to use _taskAssignments #204

Closed michaelmoss closed 7 years ago

michaelmoss commented 7 years ago

Fixes: #202

erikdw commented 7 years ago

@michaelmoss : thanks for the contribution! Have you verified if this works a-ok with you rebalancing and having the supervisor lose all of its workers and then needing to kill itself?

michaelmoss commented 7 years ago

Thanks, @erikdw.

scenario: 3 workers, 2 on host a, 1 on host b kill -9 of two workers on host a, verified that supervisor restarted those 2 workers on same host. kill -9 supervisor on host a, verified that workers immediately died with them and that supervisor and workers were respawned on same node. rebalance from storm UI, verified that all 3 workers and their supervisors were moved to different hosts. Confirmed that orphaned supervisors committed suicide.

It seems if we merge this to master, we are abandoning support for 1.0.2 and moving everyone to 1.0.3 since these changes are not backwards compatible (supervisor API changes), is that desired?

michaelmoss commented 7 years ago

@erikdw, do you think the testing was comprehensive enough? Are we ready to merge this change you think?

erikdw commented 7 years ago

Hey @michaelmoss, sorry I've been busy, and am on PTO today. I'll look next week, but your description of the testing sounds great to me!

erikdw commented 7 years ago

@michaelmoss : another question: we (@JessicaLHartog & @srishtyagrawal & me) are ready to merge #200, and I wonder if you'd like to have that in before your change or after? We should be able to easily pop out 2 releases (though I haven't run the release process in a while, and it's historically encountered random foibles as external dependencies change), so I'm open to either order.

michaelmoss commented 7 years ago

Having #200 go in after my change would definitely be easier for me to reason about. I'm going to make your requested changes this morning and retest. I'm happy to then put #200 on top of those and retest as well.

On Mon, Aug 7, 2017 at 11:06 PM, Erik Weathers notifications@github.com wrote:

@michaelmoss https://github.com/michaelmoss : another question: we ( @JessicaLHartog https://github.com/jessicalhartog & @srishtyagrawal https://github.com/srishtyagrawal & me) are ready to merge #200 https://github.com/mesos/storm/pull/200, and I wonder if you'd like to have that in before your change or after? We should be able to easily pop out 2 releases (though I haven't run the release process in a while, and it's historically encountered random foibles as external dependencies change), so I'm open to either order.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mesos/storm/pull/204#issuecomment-320837433, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMEFGbk2Gz1oSUa_F-pjMlfwN2Cdf9Zks5sV9CkgaJpZM4Oo_Ij .

michaelmoss commented 7 years ago

@erikdw , I added the requested changes and confirmed the previously described test scenarios still work.

Let me know if you are cool using reflection to call the private method 'launchDaemon'.

Side note: I noticed that on nearly every deployment, extra supervisors start up, and start up workers which die within 1-3 minutes with code 20 (see below). I must not have noticed this before. I haven't back tested this against 1.0.2, but its happening across my commits in this PR. Have you seen this?

2017-08-08 18:09:18.504 o.a.s.d.s.Slot [INFO] STATE WAITING_FOR_WORKER_START msInState: 10032 topo:foo-1502215662 worker:e0abb56d-e8ff-47f2-92bc-32f646a7d801 -> RUNNING msInState: 1 topo:topo:foo-2-1502215662 worker:e0abb56d-e8ff-47f2-92bc-32f646a7d801
2017-08-08 18:12:39.803 o.a.s.d.s.Slot [WARN] SLOT 31001: Assignment Changed from LocalAssignment(topology_id:topo:foo-2-1502215662, executors:[ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:3, task_end:3)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0)) to null
2017-08-08 18:12:39.804 s.m.MesosSupervisor [INFO] killedWorker: executor topo:foo-2-1502215662 removing port 31001 assignment and sending TASK_FINISHED update to Mesos
2017-08-08 18:12:39.805 o.a.s.d.s.Container [INFO] Killing 10.122.191.17:e0abb56d-e8ff-47f2-92bc-32f646a7d801
2017-08-08 18:12:41.253 o.a.s.d.s.BasicContainer [INFO] Worker Process e0abb56d-e8ff-47f2-92bc-32f646a7d801 exited with code: 20
erikdw commented 7 years ago

extra supervisors start up, and start up workers which die within 1-3 minutes with code 20 (see below).

@michaelmoss : hmm... that's not expected, I haven't seen that before. You're sure this isn't somehow topology-specific?

erikdw commented 7 years ago
(storm) [v1.0.3] % git grep -w 20 | grep -i -e exit -e halt
storm-core/src/clj/org/apache/storm/daemon/logviewer.clj:                                               (exit-process! 20 "Error when doing log cleanup")))
storm-core/src/clj/org/apache/storm/daemon/nimbus.clj:                                 (exit-process! 20 "Error when processing an event")
storm-core/src/clj/org/apache/storm/daemon/worker.clj:                       (exit-process! 20 "Error when processing an event")
storm-core/src/clj/org/apache/storm/util.clj:                                                    (.halt (Runtime/getRuntime) 20)))))
storm-core/src/jvm/org/apache/storm/daemon/supervisor/DefaultUncaughtExceptionHandler.java:        Utils.exitProcess(20, "Error when processing an event");
storm-core/src/jvm/org/apache/storm/daemon/supervisor/Slot.java:                Utils.exitProcess(20, "Error when processing an event");
storm-core/src/jvm/org/apache/storm/event/EventManagerImp.java:                            Utils.exitProcess(20, "Error when processing an event");
storm-core/src/jvm/org/apache/storm/utils/Utils.java:                    Runtime.getRuntime().halt(20);

Seems there are various circumstances where storm will exit with 20. Have you looked in the worker logs & worker stderr/out files?

erikdw commented 7 years ago

Actually, I shouldn't speak so fast. It's likely that exit code 20 is in fact normal and expected -- I haven't spent much time at all working with Storm 1.0+, so my knowledge of the Storm Supervisor logs is 0.9.x specific.

The logs actually are saying that the nimbus created a new slot assignment. Have you looked in the nimbus logs to see why it thought the topology wasn't running?

Finally, "extra supervisors" sounds slightly terrifying to me, and makes me want to spend time testing this myself, which will require me setting aside time I don't have at this exact moment. Would have to wait for next week most likely.

michaelmoss commented 7 years ago

Very helpful, thank you. Didn't see anything, but am looking into it now and will post back today hopefully.

Yes, confirmed launchDaemon worked.

michaelmoss commented 7 years ago

The best I can tell is that if there should be x supervisors, there are still only x active at a time, but for example, on a deployment requiring 3 supervisors, a total of 5 might have been created with 2 exit 20'ing - eventually only 3 supervors will correctly appear in the UI and exhibitor. It could just be something non-deterministic in the topology itself that might not be logging cleanly.

erikdw commented 7 years ago

How many worker hosts do you have? There should only ever be 1 supervisor at a time per topology on a given worker host.

michaelmoss commented 7 years ago

We have 50-100 very large hosts which can hosts 10-100s of topologies each. I confirmed that even when a single host gets more than a single worker of a given topology, it only has a single supervisor process managing them.

I downgraded back to 1.0.2, confirmed I didn't see this behavior. Then when I upgraded back to 1.0.3 ( to compare the nimbus/worker logs I gathered), I didn't see this behavior.

I did notice that the workers which were dying before were having issues connecting to the other components in the topologies and I think this is why they were dying. I'll post back if I see this again.

If this PR looks good, would love to get it merged in. Thanks again for everything.

erikdw commented 7 years ago

I did notice that the workers which were dying before were having issues connecting to the other components in the topologies and I think this is why they were dying. I'll post back if I see this again.

Ahh... there can be numerous reasons for this.

e.g., One corner case we've hit a few times has been a worker getting "assigned" to a host but the port not actually being available. There are bugs in the version of the kernel we are running w.r.t. cgroups which sometimes leave unkillable zombie workers where the worker port is bound to them still. That is a problem because mesos thinks the port is free and thus the nimbus can assign workers to unavailable ports.

If you're confident this isn't a fundamental problem with our change in approach here, then I'm ok merging it.

michaelmoss commented 7 years ago

Cool. Based on it not being 100% reproducible, and that your description jives with what I've seen before as well, I'm 95% sure this issue is completely unrelated and we should merge this.

erikdw commented 7 years ago

@michaelmoss : I wanna make the travis-ci builds work before merging your change. Here's a PR where I pushed an attempt to make it work:

erikdw commented 7 years ago

207 fixed the build (though we'll see if the release process works too, might have more gotchas there). Once @JessicaLHartog gives me the OK, I'll merge #207 and then I should be able to merge this one.

JessicaLHartog commented 7 years ago

207 has been merged. I tested the CI process with #205, which worked flawlessly and was then merged. (#205 was a pretty innocuous change, which shouldn't impact this on rebase.)

After that I rebased to update #200 and the checks passed there too.

@michaelmoss if you could pull down the changes from master and rebase here, then we can merge this, which will unblock me on #200. Thanks!

michaelmoss commented 7 years ago

Thanks, @JessicaLHartog. I committed the rebased branch to my fork (which is the source of this PR). I see some CI changes - is that expected? Everything look good?

erikdw commented 7 years ago

Thanks @michaelmoss, all the PRs are looking good in CI recently:

I wanna do a light sanity check with your code, but haven't had any time today (dealing with some HR & IT work issues). Hopefully later tonight.

erikdw commented 7 years ago

I also need to make time to fix and test the calling of ISupervisor.assigned(). It's possible that could get into a Storm release very soon because of this bug fix needing to get pushed out:

erikdw commented 7 years ago

Fixed in #208, so closing this.