mpalmer / lvmsync

Synchronise LVM LVs across a network by sending only snapshotted changes
http://theshed.hezmatt.org/lvmsync
GNU General Public License v3.0
380 stars 60 forks source link

Fix parser crash when LV stripes > 1 #43

Closed sjbronner closed 9 years ago

sjbronner commented 9 years ago

I really banged my head against this one.

When running the command lvmsync /dev/lvm/leipzig.vmmigrate.1424707061 tianjin:/dev/lvm/test, I kept getting the following message:

FATAL ERROR: /dev/lvm/leipzig.vmmigrate.1424707061: could not find logical volume (Cannot parse vgcfgbackup output: Expected one of [0-9], , , ", [1-9], 0, [\s], ] at line 100, column 18 (byte 2026) after lvm {
        id = "p2Kp2s-eCDp-MIJU-dars-6gwU-uYjr-wVwPhO"
<lines snipped>
        logical_volumes {
<lines snipped>
                vserver05 {
                        id = "YEQGFk-zWzO-dAwr-89ab-o5Pw-aZH5-UqPhMC"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 2

                        segment1 {
                                start_extent = 0
                                extent_count = 20480    # 80 Gigabytes

                                type = "striped"
                                stripe_count = 2
                                stripe_size = 128       # 64 Kilobytes

                                stripes = [
                                        "pv0", 43494)

The full message along with the dump file created by vgcfgbackup is at https://gist.github.com/sjbronner/bc6478b9756a94106ca1.

I went on several wild goose chases trying to find an error in the code of lvmsync and/or treetop. In the end, I finally understood how treetop worked enough to guess at what the entries in vgcfgbackup.treetop should do. And then it glared in my face: The list of physical volumes behind a logical volume was separated by ",\n", not by ", ", as specified there.

mpalmer commented 9 years ago

Thanks for the patch!

Since you provided the full vgcfgbackup output, I've been able to put together a test case which demonstrates the problem, so I can be confident the problem won't recur in the future. I've never used striped LVs, so it's never come up for me before.

One thing I'm rather concerned about is that if you try to do an lvmsync off a striped LV, it probably won't work. I've certainly never tested it, and since lvmsync does things at a very low-level in the LVM metadata, I have no real idea whether it'll explode, horribly maiming innocent bystanders, or quietly do something terrible that won't be noticed for some time to come. Testing would be strongly advised before relying on its ability to sync off a striped volume.

Of course, if you're doing lvmsync off linear volumes, but lvmsync's parser was just choking on some other striped volumes you've got on the machine, then this patch will do the job just fine.

I've merged your patch, and released it in version 3.2.1. Hope all goes well for you from here.

sjbronner commented 9 years ago

Thanks for the heads-up about potential issues with striped volumes.

So, I tested it. I created a snapshot, and then wrote 400MB of random data into a single file of the original volume. My assumption is that that data was contiguous (to the most part) and affected both stripes. And then I ran lvmsync and md5sum on the source and target volumes.

The result:

29d1fe26fd2891bd20244a522fc473e5  /dev/lvm/leipzig
29d1fe26fd2891bd20244a522fc473e5  /dev/lvm/test

They are identical!

Thank you for integrating my change so quickly! And I'm really happy that your low-level code works right with stripes even though you hadn't had that use-case in mind while writing it.

sjbronner commented 9 years ago

One more thing that just crossed my mind. You released the change in version 3.2.1, which won't update installations with version 3.3.0. Was that an oversight, or is 3.3.0 special in some way?

mpalmer commented 9 years ago

I'm pleasantly surprised it works with striping as-is; thanks for testing and letting me know. As to the version numbering, that was a screwup on my part; I've re-released the same code as 3.3.1, to allow proper upgrading.