mesos / storm

Storm on Mesos!
Apache License 2.0
138 stars 66 forks source link

Storm v1.0.3 Support - MesosSupervisor Committing Suicide #202

Closed michaelmoss closed 7 years ago

michaelmoss commented 7 years ago

Hi, All. Tremendous work in getting Storm 1.x support with this framework. So far so good with 1.0.2; thank you everyone for your contributions.

I've attempted to drop in Storm 1.0.3 and ran into an issue which I think is ultimately related to a refactor of converting some clojure code to java in the storm project, specifically the Supervisor (https://issues.apache.org/jira/browse/STORM-2018)

What's happening is that by upgrading to Storm 1.0.3 the "assigned" method is no longer being called, as a result, the MesosSupervisor is not updating its internal view of '_supervisorViewOfAssignedPorts' and is committing suicide:

I patched this locally, by calling the 'assigned' method from the 'confirmAssigned' method which gets called at the same frequency as 1.0.2 was calling assigned. This fixes the issue and I can put in a PR for this, but I'm wondering if folks have some context as to why assigned is no longer being called and that if this change could have unintended consequences.

Here's some stacks of the calls in 1.0.2 vs 1.0.3 (I just threw exceptions in the code, and printed the trace):

1.0.2:

2017-07-21 21:51:15.616 s.m.MesosSupervisor [INFO] confirmAssigned, stack = java.lang.Throwable
    at storm.mesos.MesosSupervisor.confirmAssigned(MesosSupervisor.java:114)
    at org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9163$fn__9167.invoke(supervisor.clj:548)
    at org.apache.storm.util$filter_key$fn__469.invoke(util.clj:318)
    at clojure.core$filter$fn__4580.invoke(core.clj:2690)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:507)
    at clojure.core$seq__4128.invoke(core.clj:137)
    at clojure.core.protocols$seq_reduce.invoke(protocols.clj:30)
    at clojure.core.protocols$fn__6506.invoke(protocols.clj:101)
    at clojure.core.protocols$fn__6452$G__6447__6465.invoke(protocols.clj:13)
    at clojure.core$reduce.invoke(core.clj:6519)
    at clojure.core$into.invoke(core.clj:6600)
    at org.apache.storm.util$filter_key.invoke(util.clj:318)
    at org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9163.invoke(supervisor.clj:548)
    at org.apache.storm.event$event_manager$fn__8735.invoke(event.clj:40)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.lang.Thread.run(Thread.java:745)

2017-07-21 21:51:15.617 s.m.MesosSupervisor [INFO] assigned = [31001]
2017-07-21 21:51:15.618 s.m.MesosSupervisor [INFO] assigned, stack = java.lang.Throwable
    at storm.mesos.MesosSupervisor.assigned(MesosSupervisor.java:72)
    at org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9163.invoke(supervisor.clj:592)
    at org.apache.storm.event$event_manager$fn__8735.invoke(event.clj:40)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.lang.Thread.run(Thread.java:745)

1.0.3 (with my patch where confirmAssigned calls assigned):

2017-07-21 22:06:12.003 o.a.s.d.s.Supervisor [INFO] Starting supervisor with id foo at host alphab-bvlt-r1n89.
2017-07-21 22:06:12.941 s.m.MesosSupervisor [INFO] getMetadata: ports = [31000]
2017-07-21 22:06:13.015 s.m.MesosSupervisor [INFO] confirmAssigned: port = 31000
2017-07-21 22:06:13.017 s.m.MesosSupervisor [INFO] confirmAssigned, stack = java.lang.Throwable
    at storm.mesos.MesosSupervisor.confirmAssigned(MesosSupervisor.java:123)
    at org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:143)
    at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54)

2017-07-21 22:06:13.017 s.m.MesosSupervisor [INFO] assigned: ports = 
2017-07-21 22:06:13.017 s.m.MesosSupervisor [INFO] assigned, stack = java.lang.Throwable
    at storm.mesos.MesosSupervisor.assigned(MesosSupervisor.java:80)
    at storm.mesos.MesosSupervisor.confirmAssigned(MesosSupervisor.java:127)
    at org.apache.storm.daemon.supervisor.ReadClusterState.run(ReadClusterState.java:143)
    at org.apache.storm.event.EventManagerImp$1.run(EventManagerImp.java:54)

2017-07-21 22:06:13.020 o.a.s.d.s.Slot [WARN] SLOT alphab-bvlt-r1n89:31000 Starting in state EMPTY - assignment null
2017-07-21 22:06:13.030 o.a.s.d.s.Slot [INFO] STATE EMPTY msInState: 10 -> WAITING_FOR_BASIC_LOCALIZATION msInState: 0
2017-07-21 22:06:13.128 o.a.s.u.NimbusClient [INFO] Found leader nimbus : alphab-bvlt-r1n85:31075
2017-07-21 22:06:13.592 s.m.MesosSupervisor [INFO] SuicideDetector: _supervisorViewOfAssignedPorts = [31000], now = 1500674773592, _lastTime = 1500674768591
erikdw commented 7 years ago

@michaelmoss : thanks for both the report and doing investigations into the root cause! I haven't looked at the changes they made to storm-core's supervisor yet. I understand the logic pretty well in the old clojure code (for the supervisor at least!) and I understand this storm-mesos code well. So I'll look into it and see if it's acceptable to just accept your proposed PR for using confirmAssigned() instead of assigned(). I don't remember exactly what assigned() was used for at the moment.

michaelmoss commented 7 years ago

Thanks, @erikdw. I'm also attempting to upgrade to 1.1.0 which I will create a separate report for (topologies will not accept any mesos resources).

This has me thinking, is there any value in this project creating separate branches for different versions of storm? How can I help?

erikdw commented 7 years ago

The report you gave here about the issue in 1.0.3 is a tremendous help already! Only way you could do more for the issue would be if you were totally confident in the behaviors such that you knew for sure that switching from assigned to just confirmAssigned would be ok (I haven't had a chance yet to look into that).

Regarding the 1.1.0 issue, if you could do a similar level of analysis on why the topologies aren't getting any slots that would be awesome.

michaelmoss commented 7 years ago

@erikdw, cool. Will post analysis on the 1.1.0 issue.

What do you think about creating separate branches in this repo to support different versions of storm?

erikdw commented 7 years ago

I really don't wanna create any more branches than we need. We needed to create a separate branch for 1.x versus 0.x because of the package path rename (backtype.storm -> org.apache.storm). Every extra branch means more work for backporting changes, and we are already way too thin on my team to shepherd that.

My perspective on the specific issue here (#202) is that we should be able to make a change to the framework that doesn't break across the different versions of Storm.

I fear that the 1.1.0 issue will end up being another interface breakage, but in a deeper part of the Storm code (we had to reimplement Storm's internal scheduler to avoid issues with unschedulable "large" topologies blocking others from running and also to deal with fragmentation of Offers), and that it will require changes to Storm itself to fix.

erikdw commented 7 years ago

I patched this locally, by calling the 'assigned' method from the 'confirmAssigned' method which gets called at the same frequency as 1.0.2 was calling assigned. This fixes the issue and I can put in a PR for this, but I'm wondering if folks have some context as to why assigned is no longer being called and that if this change could have unintended consequences.

so... I think this might be ok. The general behavior before:

Behavior proposed for 1.0.3+:

erikdw commented 7 years ago

Actually... I think this won't work: we need to be informed or figure out when there are no workers assigned to this supervisor. That was previously happening when assigned was called with an empty set. We might have to rely on only tracking the content of the _taskAssignments variable and checking that from the suicide-checker. That should work I think.

michaelmoss commented 7 years ago

Thanks, @erikdw. I can take a stab at this today or tomorrow. I'm assuming these changes would be backwards compatible with prior releases in the 1.0.x line, particularly 1.0.2. Will test.

erikdw commented 7 years ago

Released v0.2.3 of this project with the fix from #208 included.

So this issue is done.

erikdw commented 7 years ago

@michaelmoss : FYI, I spent some time today to dig into the "Storm v1.1.0+ not working" issue: it's really bad. I filed a new issue in our project here: #214