Closed michaelmoss closed 7 years ago
@michaelmoss : thanks for both the report and doing investigations into the root cause! I haven't looked at the changes they made to storm-core's supervisor yet. I understand the logic pretty well in the old clojure code (for the supervisor at least!) and I understand this storm-mesos code well. So I'll look into it and see if it's acceptable to just accept your proposed PR for using confirmAssigned() instead of assigned(). I don't remember exactly what assigned() was used for at the moment.
Thanks, @erikdw. I'm also attempting to upgrade to 1.1.0 which I will create a separate report for (topologies will not accept any mesos resources).
This has me thinking, is there any value in this project creating separate branches for different versions of storm? How can I help?
The report you gave here about the issue in 1.0.3 is a tremendous help already! Only way you could do more for the issue would be if you were totally confident in the behaviors such that you knew for sure that switching from assigned to just confirmAssigned would be ok (I haven't had a chance yet to look into that).
Regarding the 1.1.0 issue, if you could do a similar level of analysis on why the topologies aren't getting any slots that would be awesome.
@erikdw, cool. Will post analysis on the 1.1.0 issue.
What do you think about creating separate branches in this repo to support different versions of storm?
I really don't wanna create any more branches than we need. We needed to create a separate branch for 1.x versus 0.x because of the package path rename (backtype.storm
-> org.apache.storm
). Every extra branch means more work for backporting changes, and we are already way too thin on my team to shepherd that.
My perspective on the specific issue here (#202) is that we should be able to make a change to the framework that doesn't break across the different versions of Storm.
I fear that the 1.1.0 issue will end up being another interface breakage, but in a deeper part of the Storm code (we had to reimplement Storm's internal scheduler to avoid issues with unschedulable "large" topologies blocking others from running and also to deal with fragmentation of Offers), and that it will require changes to Storm itself to fix.
I patched this locally, by calling the 'assigned' method from the 'confirmAssigned' method which gets called at the same frequency as 1.0.2 was calling assigned. This fixes the issue and I can put in a PR for this, but I'm wondering if folks have some context as to why assigned is no longer being called and that if this change could have unintended consequences.
so... I think this might be ok. The general behavior before:
confirmAssigned
: called by the Supervisor core to determine whether a given port it sees as assigned to this host via ZooKeeper is a port that this Supervisor should be launching a worker on. This call is absolutely critical for the correct behavior of storm-mesos:
confirmAssigned
is using the conveyed-via-mesos-task-launch info to say whether a port belongs to this supervisor or not.assigned
: called with all of the confirmed-as-being-assigned ports
_supervisorViewOfAssignedPorts
field all of the currently assigned ports, and that was all that this method did._supervisorViewOfAssignedPorts
field was only used to determine if there are any ports currently assigned, and this drove the suicide logic.Behavior proposed for 1.0.3+:
confirmAssigned
: I'm assuming there have been no behavior changes.
assigned
: unsure exactly what you passed in your call to assigned
from confirmAssigned
, but if we can join the 2 pieces of knowledge then I think this will suffice for replacing the purpose of the handling of assigned
before:
confirmAssigned
) and it's also known to have been assigned to the executor via mesos (i.e., _taskAssignments.confirmAssigned(port)
returns true
), then we have a port assigned and we can toggle a boolean to that effect. Thus allowing us to remove _supervisorViewOfAssignedPorts
in favor of that boolean. Something like _supervisorHasAPortAssigned
.Actually... I think this won't work: we need to be informed or figure out when there are no workers assigned to this supervisor. That was previously happening when assigned
was called with an empty set. We might have to rely on only tracking the content of the _taskAssignments
variable and checking that from the suicide-checker. That should work I think.
Thanks, @erikdw. I can take a stab at this today or tomorrow. I'm assuming these changes would be backwards compatible with prior releases in the 1.0.x line, particularly 1.0.2. Will test.
Released v0.2.3 of this project with the fix from #208 included.
So this issue is done.
@michaelmoss : FYI, I spent some time today to dig into the "Storm v1.1.0+ not working" issue: it's really bad. I filed a new issue in our project here: #214
Hi, All. Tremendous work in getting Storm 1.x support with this framework. So far so good with 1.0.2; thank you everyone for your contributions.
I've attempted to drop in Storm 1.0.3 and ran into an issue which I think is ultimately related to a refactor of converting some clojure code to java in the storm project, specifically the Supervisor (https://issues.apache.org/jira/browse/STORM-2018)
What's happening is that by upgrading to Storm 1.0.3 the "assigned" method is no longer being called, as a result, the MesosSupervisor is not updating its internal view of '_supervisorViewOfAssignedPorts' and is committing suicide:
I patched this locally, by calling the 'assigned' method from the 'confirmAssigned' method which gets called at the same frequency as 1.0.2 was calling assigned. This fixes the issue and I can put in a PR for this, but I'm wondering if folks have some context as to why assigned is no longer being called and that if this change could have unintended consequences.
Here's some stacks of the calls in 1.0.2 vs 1.0.3 (I just threw exceptions in the code, and printed the trace):
1.0.2:
1.0.3 (with my patch where confirmAssigned calls assigned):