Closed armon closed 8 years ago
Thanks for writing that up, @armon. I'll take a shot at "Change inflight tracking".
I'll start on 'precommit may not be necessary with new inflight' (though I only have about 20min right now, ugh).
Ok I'm going to go ahead and merge #117. Here's the list of remaining tasks I identified from that. Most of these are already marked inline with TODO comments:
NewRaft()
to understand legacy LogAddPeer
and LogRemovePeer
entries #139LogAddPeer
and LogRemovePeer
to make the transition work?) #139TestRaft_SnapshotRestore_PeerChange
#139GetConfiguration()
#134ServerAddress
return type for Leader()
#127quorumSize()
#136replicate()
#136updateLastAppended()
#136 Configuration
to something less conflate-y with Config
#165installSnapshot
that exercises the path that supports the old log messages for peer configuration #164Stats()
, the "%+v"
formatting is weird / ugly. Also add a test to make sure it shows up (fixed a bug where the logic was wrong). #167makePeerInternal()
default the protocol version to max if 0 is passed in - we should always configure a protocol version at this level (and 0 is a valid version) #170@ongardie got one trivial PR up #127 and will be sweeping through the unit tests tomorrow fixing the integration tests as well as the commented out ones.
Cool, thanks @slackpad. I've got a busy day tomorrow but can help out later in the week.
Cool I will keep adding the unit test work to the same PR in that case.
On Jun 27, 2016, at 11:50 PM, Diego Ongaro notifications@github.com wrote:
Cool, thanks @slackpad. I've got a busy day tomorrow but can help out later in the week.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Ok I think #127 is in good shape for merge. Tomorrow I'm going to spend some time on these two items:
Work through live upgrade scenario now that we have a new config format and ignore old config messages (will we need to support the legacy LogAddPeer and LogRemovePeer to make the transition work?) We need a mechanism for manual peer updates after an outage - there's currently no equivalent of "editing peers.json" so it doesn't seem possible to recover from all outage scenarios; there's one broken test related to this as well: TestRaft_SnapshotRestore_PeerChange
Will experiment a little and write up a spec so we can figure out the right way forward with these. We will need a way for Consul to take the existing peers.json and call some new Raft APIs in order to start up the cluster as well, since the peers.json had the only state related to the peers if a snapshot had been taken. I don't think I want to reintroduce the peer store into Raft, but Consul will need to parse this file outside of Raft and call some bootstrap-y thing to get going. I think this problem is closely related to the items above.
@slackpad, small request: could you number the post-117 checklist above? It'd help me track things.
@ongardie numbered the todo list
BTW I'm trying to refactor a ton of stuff that'll help or fix the following:
so save those for me for now :)
@ongardie thanks for the heads up. I'm working on 1, 2, 3 which should not conflict with your stuff too much. Should have a PR up soon so you can see what I'm thinking wrt. to these. Have a good weekend!
@slackpad, now that I know you don't have too much outstanding, can we move runFSM
out to fsm.go
and runSnapshots
, shouldSnapshot
, takeSnapshot
out to snapshot.go
? See where I'm going with this? Every goroutine that has its own independent state goes in a different file. Later on, they get kicked off of the Raft struct so that they're not reaching into things they shouldn't.
Yep - I was starting to think about that, too. I'll go ahead and make that change and commit it now so we have a good basis for further work.
@ongardie the split is done under https://github.com/hashicorp/raft/pull/135 - went ahead and merged it.
Thanks @slackpad!
Some bookkeeping:
membership.md
now that the work happened.type configurations
that latest
and committed
may be the same, and often are.I think that's all I'm sitting on.
@ongardie-sfdc thanks for the update and the detailed reviews! I reconciled all your changes with the checklists (added new numbered items for the two new ones).
After merging the version and recovery stuff into issue-84-integration, I cut https://github.com/hashicorp/raft/tree/library-v2-stage-one. I'll integrate and ship the next version of Consul against this branch and keep it open for small fixes I can pick from here if needed. This'll free up the issue-84-integration branch for the goroutine changes and a possible period of instability as that settles in.
After giving the community a heads up wrt. the interface changes we can take issue-84-integration into master, and possibly make library-v2-stage-two as an interim bake and integrate target for future Consul versions.
Am seeing some issues with AddPeer after last nights merge, Is there a test somewhere that adds a new peer to a cluster, I did a scan though but didn't spot one for that. @ongardie @slackpad
Hi @superfell - there's a test here that hits it - https://github.com/hashicorp/raft/blob/issue-84-integration/raft_test.go#L2217-L2258. It's likely that you are calling AddPeer()
but haven't set the ProtcolVersion
back via configuration to a version that supports it - if you set it to 1
things should work as before.
Also, if you don't need to support any of the old stuff, I'd keep the ProtocolVersion
at 3 and swap out your AddPeer()
call with a call to AddVoter()
.
Ahh, the setting of protocolVer in DefaultConfig() was messing with me.
Everything is done or tracked elsewhere - closing!
Here are the list of issues, grouped together if they might make sense as a single PR.
State Races
Multi-Row Fetching
Replace single row lookups with multi row lookups (LogStore / LogCache) (look at cases around log truncation)Follower Replication:
replicateTo
should verify leadership is current during loopingstopCh
Change Inflight Tracking
precommit
may not be necessary with new inflight (likely will be cleaned up via #117)Improve Membership Tracking
Crashes / Restart Issues
Panic with old snapshots #85TrailingLogs set to 0 with restart bug #86New Tests
/cc: @superfell @ongardie @sean- @ryanuber @slackpad