Starting slave from a correct position with --base-seqno

GoogleCodeExporter commented 9 years ago

1. To which tool/application/daemon will this feature apply?

'./trepctl online' options.

2. Describe the feature in general

We would really like a simple way to:

a. Start the slave with a backup without the trep_commit_seqno table.
b. Recover from various errors easily, by starting the slave from a specific 
position.

However, currently you'll need trep_commit_seqno table when reloading the 
backup or specify many numbers for the --skip-seqno option, which becomes 
daunting as soon as you have a handful of events.

We do have a --base-seqno option, but it only works on a master and sets the 
starting log number. On a slave it does not act as we would like - we read the 
full log starting at seqno=0 and therefore have to add -skip-seqno 1,2,...,N 
where N is the number before where you would like to start. 

3. Describe the feature interface

The following needs to work on the slave:

./trepctl online --base-seqno X

It should skip all the events until X and start from there.

Original issue reported on code.google.com by linas.vi...@continuent.com on 3 Aug 2012 at 6:33

Blocking: #503
Blocked on: #777, #846

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 20 Mar 2013 at 1:39

Now blocking: #503

GoogleCodeExporter commented 9 years ago

This popped up in a customer's deployment. The way -base-seqno behaves on the 
slave currently is counter-intuitive and dangerous:

1. Put slave offline.
2. Insert a few transactions on the master.
3. Try to put slave back online at the very last event (ignore all others that 
were inserted): trepctl online -base-seqno <last_committed_seqno>.
4. Slave will go online successfully, but it will have applied *all* events, 
effectively making -base-seqno useless.

Original comment by linas.vi...@continuent.com on 13 Dec 2013 at 8:39

Added labels: FixedIn-2.2.1, FoundIn-2.0.5, Priority-High
Removed labels: Foundin, Priority-Low

GoogleCodeExporter commented 9 years ago

The option behavior is definitely confusing.   Here's the current state of 
things in the replicator. 

-base-seqno :  This option is designed to allow a master to regenerate the log 
from a seqno value other than 0.  It has no other use and should probably be 
regarded as an error in any other case.  It would also be helpful to rename it 
as it is different from choosing a restart position.  

-from-event:  This option allows a master to read from the DBMS log (whether 
binlog or Oracle CDC) from a particular position.  When you enter this option 
on a slave, it will cause the replicator to search forward in the log until it 
can find the event ID.  It's potentially a very slow operation.  Looking at the 
code, moreover, I don't see a simple way to force the replicator to start at a 
particular seqno when searching forward.  

There is no way to force a slave to start applying at a particular seqno using 
online options.  Moreover, slaves actually have two restart positions.  There's 
the position of the log and the position of the slave.  

1. Log position.  The THL is designed to avoid gaps in the log, as these create 
ambiguity about whether we are skipping events or the log is corrupt.  We don't 
want to skip events here or if we do we need to create a filtered event to fill 
in the gap.  Computing the gap is a little tricky since the extractor that 
pulls from the master does not know the current state of the log.  Instead it 
would have to say something like "I'm resetting the log position" and let a 
downstream applier to the log figure out what to do, which includes handling 
corner cases where the new seqno position leads to regenerating different 
records with earlier seqno values.  The log should catch trying to add earlier 
seqnos but it would still create a nasty error. 

2. Slave position.  When you ask a slave to start at a particular position, 
what you really want to do is set the position in the trep_commit_seqno so that 
the slave reports a higher seqno than it would otherwise.  We should probably 
add an option to set the slave position explicitly.  This would update the 
trep_commit_seqno table just as just Jeff's script does now.  One nice feature 
would be to allow this to happen while the slave is offline, which would 
potentially be less confusing than trying to make more options for the online 
option.  

For this reason, I think the existing tungsten_set_position script is a good 
interim solution.  

It's possible I'm missing something from the code but this is how things work 
now. The fact is that the replicator is incomplete here so it would be a good 
idea to fill things in a bit especially as regards the slave position.  Making 
things better should not be too hard.

Original comment by robert.h...@continuent.com on 14 Dec 2013 at 6:44

GoogleCodeExporter commented 9 years ago

The tungsten_set_position script was added in 2.2.0 for just this reason. It 
currently only supports MySQL but can inspect the THL event on a remote server 
and set the trep_commit_seqno table. It will also accept all necessary values 
at the command line for use when setting the initial extraction position.

https://docs.continuent.com/continuent-tungsten-2.0/deployment-replicatorin.html

Original comment by jeff.m...@continuent.com on 29 Jan 2014 at 3:17

GoogleCodeExporter commented 9 years ago

Positioning is causing a lot of problems.  We should fix this after data 
sources are fully implemented in Tungsten Replicator 3.0.  This work should 
make it easier to address positioning issues.

Original comment by robert.h...@continuent.com on 5 May 2014 at 11:16

Added labels: FixedIn-3.0.1
Removed labels: FixedIn-2.2.1

GoogleCodeExporter commented 9 years ago

Will not use third version digit for normal releases anymore. It will only be 
increment for maintenance ones.

Original comment by linas.vi...@continuent.com on 26 May 2014 at 5:01

Added labels: FixedIn-3.1.0
Removed labels: FixedIn-3.0.1

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 2 Jun 2014 at 5:53

Now blocked on: #777, #846

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18

Added labels: FixedIn-4.0.1
Removed labels: FixedIn-3.1.0

tejeswarp / tungsten-replicator

Starting slave from a correct position with --base-seqno #356