junneyang / zumastor

Automatically exported from code.google.com/p/zumastor
0 stars 1 forks source link

Slow mounting renders replicated volumes unusable #85

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Setup origin and target as explained by the howto. Set the replication
period short enough proportionally to the volume size.
2.
3.

What is the expected output? What do you see instead?

Volume should be available. Instead:
- Delta is transferred from origin to target (volume is available for this
time)
- Volume is unmounted
- Delta is applied
- Volume is mounted

If the delta is small enough (for example, in cases where no changes have
been performed to the volume, thus resulting in instantly transmitted
deltas) and mounting takes long enough, right after Zumastor finished
mounting a volume, Zumastor unmounts it to apply the next delta. 

In a 5GB volume with a 5 seconds replication period, this means the volume
is available only for milliseconds.

What version of the product are you using? On what operating system?

0.7r1419 on Ubuntu using zumastor-team packages

Please provide any additional information below.

I'd say this is a release blocker

Original issue reported on code.google.com by pgqui...@gmail.com on 28 Feb 2008 at 2:39

GoogleCodeExporter commented 9 years ago
Here come the logs. This issue is probably closely related to issues 80 (still 
open)
and 71 (already closed).

Original comment by pgqui...@gmail.com on 28 Feb 2008 at 2:50

Attachments:

GoogleCodeExporter commented 9 years ago
There are a couple optimization which are relatively easy to do to fix this.  
For the
"no changes" case, we can simply check for this downstream and completely skip 
the
replication cycle when there is no change.  The original implementation did 
this, but
it was removed at some point in debugging because it was believed to be a 
premature
optimization.  Perhaps we could also add a way to change the snapshot metadata 
(id
and creation time) on the fly, so it won't look like replication just stopped 
when
there was really no churn.

We also want to speed up the snapshot rotation downstream.  We tried using 
"mount
--move" about a year ago to make this more bumpless.  The "mount" command in
util-linux was broken preventing this from working.  That bug should be fixed 
now in
util-linux-ng, so we can take advantage of it.  

Original comment by sha...@gmail.com on 29 Feb 2008 at 4:41

GoogleCodeExporter commented 9 years ago
pgquiles,
does this problem prevent you from testing zumastor in production?
I'd like to understand what makes this a release blocker for you.

Also, if we could arrange for all references to the volume to
block during the switch, would that suffice?

Original comment by daniel.r...@gmail.com on 4 Mar 2008 at 1:54

GoogleCodeExporter commented 9 years ago
> does this problem prevent you from testing zumastor in production?

Yes, it's definitely a blocker because due to lots of mounting/unmounting, in 
the
replica the data is not really available: volumes stay mounted for such a short
amount of time, you start copying a file using Samba and before the copy 
finishes,
the volume vanishes.

> Also, if we could arrange for all references to the volume to
> block during the switch, would that suffice?

Sorry, I don't understand what you mean. What are the references? (the open 
files?)
What's the "switch"? (unmounting + applying delta + mounting?)

Original comment by pgqui...@gmail.com on 4 Mar 2008 at 8:12

GoogleCodeExporter commented 9 years ago
I clarified this with pgquiles on irc.
He doesn't plan to use 5 second snapshot intervals in production.
He's testing 5GB/5 seconds replication on the assumption that
it's a good model for 1TB volumes with 1 hour replication cycles,
which is much closer to what he will be running in production.
He's ok with a 1TB volume being offline for 5 seconds during 
switchover to a new incoming snapshot.

Original comment by daniel.r...@gmail.com on 4 Mar 2008 at 1:01

GoogleCodeExporter commented 9 years ago
So I think the action here is simply to verify that
the downtime during downstream snapshot application
for volumes between 5GB and 1TB
is shorter than 5 seconds, and does not scale
with size of volume or snapshot.  This should be easy,
I think we already meet that.

Original comment by daniel.r...@gmail.com on 4 Mar 2008 at 8:41

GoogleCodeExporter commented 9 years ago
I think this is ok, so I'm closing as invalid.  If I'm wrong, please reopen.

Original comment by daniel.r...@gmail.com on 6 Mar 2008 at 6:21