hashicorp / raft

Golang implementation of the Raft consensus protocol
Mozilla Public License 2.0
8.25k stars 995 forks source link

Cleanup Meta Ticket #84

Closed armon closed 8 years ago

armon commented 8 years ago

Here are the list of issues, grouped together if they might make sense as a single PR.

State Races

Multi-Row Fetching

Follower Replication:

Change Inflight Tracking

Improve Membership Tracking

Crashes / Restart Issues

New Tests

/cc: @superfell @ongardie @sean- @ryanuber @slackpad

ongardie commented 8 years ago

Thanks for writing that up, @armon. I'll take a shot at "Change inflight tracking".

ongardie commented 8 years ago

I'll start on 'precommit may not be necessary with new inflight' (though I only have about 20min right now, ugh).

slackpad commented 8 years ago

Ok I'm going to go ahead and merge #117. Here's the list of remaining tasks I identified from that. Most of these are already marked inline with TODO comments:

slackpad commented 8 years ago

@ongardie got one trivial PR up #127 and will be sweeping through the unit tests tomorrow fixing the integration tests as well as the commented out ones.

ongardie commented 8 years ago

Cool, thanks @slackpad. I've got a busy day tomorrow but can help out later in the week.

slackpad commented 8 years ago

Cool I will keep adding the unit test work to the same PR in that case.

On Jun 27, 2016, at 11:50 PM, Diego Ongaro notifications@github.com wrote:

Cool, thanks @slackpad. I've got a busy day tomorrow but can help out later in the week.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

slackpad commented 8 years ago

Ok I think #127 is in good shape for merge. Tomorrow I'm going to spend some time on these two items:

Work through live upgrade scenario now that we have a new config format and ignore old config messages (will we need to support the legacy LogAddPeer and LogRemovePeer to make the transition work?) We need a mechanism for manual peer updates after an outage - there's currently no equivalent of "editing peers.json" so it doesn't seem possible to recover from all outage scenarios; there's one broken test related to this as well: TestRaft_SnapshotRestore_PeerChange

Will experiment a little and write up a spec so we can figure out the right way forward with these. We will need a way for Consul to take the existing peers.json and call some new Raft APIs in order to start up the cluster as well, since the peers.json had the only state related to the peers if a snapshot had been taken. I don't think I want to reintroduce the peer store into Raft, but Consul will need to parse this file outside of Raft and call some bootstrap-y thing to get going. I think this problem is closely related to the items above.

ongardie-sfdc commented 8 years ago

@slackpad, small request: could you number the post-117 checklist above? It'd help me track things.

slackpad commented 8 years ago

@ongardie numbered the todo list

ongardie-sfdc commented 8 years ago

BTW I'm trying to refactor a ton of stuff that'll help or fix the following:

so save those for me for now :)

slackpad commented 8 years ago

@ongardie thanks for the heads up. I'm working on 1, 2, 3 which should not conflict with your stuff too much. Should have a PR up soon so you can see what I'm thinking wrt. to these. Have a good weekend!

ongardie-sfdc commented 8 years ago

@slackpad, now that I know you don't have too much outstanding, can we move runFSM out to fsm.go and runSnapshots, shouldSnapshot, takeSnapshot out to snapshot.go? See where I'm going with this? Every goroutine that has its own independent state goes in a different file. Later on, they get kicked off of the Raft struct so that they're not reaching into things they shouldn't.

slackpad commented 8 years ago

Yep - I was starting to think about that, too. I'll go ahead and make that change and commit it now so we have a good basis for further work.

slackpad commented 8 years ago

@ongardie the split is done under https://github.com/hashicorp/raft/pull/135 - went ahead and merged it.

ongardie-sfdc commented 8 years ago

Thanks @slackpad!

ongardie-sfdc commented 8 years ago

Some bookkeeping:

I think that's all I'm sitting on.

slackpad commented 8 years ago

@ongardie-sfdc thanks for the update and the detailed reviews! I reconciled all your changes with the checklists (added new numbered items for the two new ones).

slackpad commented 8 years ago

After merging the version and recovery stuff into issue-84-integration, I cut https://github.com/hashicorp/raft/tree/library-v2-stage-one. I'll integrate and ship the next version of Consul against this branch and keep it open for small fixes I can pick from here if needed. This'll free up the issue-84-integration branch for the goroutine changes and a possible period of instability as that settles in.

After giving the community a heads up wrt. the interface changes we can take issue-84-integration into master, and possibly make library-v2-stage-two as an interim bake and integrate target for future Consul versions.

superfell commented 8 years ago

Am seeing some issues with AddPeer after last nights merge, Is there a test somewhere that adds a new peer to a cluster, I did a scan though but didn't spot one for that. @ongardie @slackpad

slackpad commented 8 years ago

Hi @superfell - there's a test here that hits it - https://github.com/hashicorp/raft/blob/issue-84-integration/raft_test.go#L2217-L2258. It's likely that you are calling AddPeer() but haven't set the ProtcolVersion back via configuration to a version that supports it - if you set it to 1 things should work as before.

slackpad commented 8 years ago

Also, if you don't need to support any of the old stuff, I'd keep the ProtocolVersion at 3 and swap out your AddPeer() call with a call to AddVoter().

superfell commented 8 years ago

Ahh, the setting of protocolVer in DefaultConfig() was messing with me.

slackpad commented 8 years ago

Everything is done or tracked elsewhere - closing!