schoebel / mars

Asynchronous Block-Level Storage Replication

GNU General Public License v2.0

233 stars 35 forks source link

Using snapshots #39

Closed marksaitis closed 2 years ago

marksaitis commented 2 years ago

Is it possible to use snapshots on mars volumes on a destination system, using LVM? For instance, using daily snapshot schedule for 7 days it would allow to have 7 different versions of the replica in the destination. And use merging to ensure you always have 7 versions.

schoebel commented 2 years ago

Hi Mark,

in theory, it should be possible to stack another LVM instance on top of /dev/mars/$name or similar. However, I had never tried this.

I just tried it under a KVM test machine, but it didn't work as explained in the following.

Here is what I have done for testing:

Added to /etc/lvm/lvm.conf two new lines: preferred_names = [ "^/dev/mars/" ] types = [ "mars", 16 ]

Otherwise I got some errors like "Device /dev/mars/$name excluded by a filter". But this part is easy to workaround for testing.

However, just changing the LVM filter rules is not enough. The following appears to work now, but the effect is NOT what I wanted to achieve:

marsadm create-resource upper-0 /dev/vg-test/lower-0

then I can successfully run

pvcreate /dev/mars/upper-0

However, pvs then spits a lot of warnings, and the resuling list of PVs looks differently than expected: instead of /dev/mars/upper-0 the entry /dev/vg-test/lower-0 is shown :(

I guess that the reason could be udev, or a similar daemon (depending on your Distro, whether you have systemd, and so on), or similar. Here are my thoughts:

Currently, MARS stores exactly the same content in /dev/mars/upper-0 and in /dev/lv-test/lower-0 .

Traditionally, this was viewed as a feature because you can not only migrate between DRBD and MARS (forth and backwards), but you can also migrate a purely local LV into some geo-redundant setup, or even downgrade again to a purely local setup in reverse direction.

However, what we wanted HERE to do is VERY different: we wanted to have two different device content for the lower-$name and for upper-$name devices, each. Otherwise, we could not stack them over each other.

AFAICS the udev or whatever automatics tries to make the reboot-safe setup for the kernel-level dm components. Therefore it is always scanning the PV superblock which contains some LVM-level UUID. Of course, when 2 devices have different names in the filesystem tree, but are containing exactly the same internal LVM-UUID, then they are counted as the "same" device by udev (or even by blkid or whatever tool you are using for analysis or for automated setup).

Well, I think a solution should be possible.

However, I am overloaded by a lot of other work, and currently I am working on OpenSource only in my precious holidays.

So I see the following solutions for you:

a) WORKAROUND: try to not stack LVM instances onto each other. For example, use zfs for either the upper or for the lower instance, and use LVM for the other layer. Or use something else which can be stacked more easily.

I have no experience with suchalike. So I would be glad if you could inform me (via email) about your war story or about your success story ;)

b) try to help me for a solution.

c) wait until I have resolved it in my spare time.

I am just thinking about several alternatives for a MARS-level solution:

A) the stacking order could be fully encapsulated at configuration level, so /dev/mars/upper-$name would be slightly bigger than /dev/vg-test/lower-$name because it need to contain 1 more nested superblock (having different internal UUIDs). This solution would be more generic, but sysadmins would be responsible for creation of virtual device with the right sizes, similar to "dd seek=$right_number or skip=$right_number. Any human error could destroy your data :(

B) a future MARS feature could try to separate the additional upper-layer superblock from the full lower-layer device content. So MARS would take over the responsibility for LVM stacking. Of course, this would be more work, also for testing. It would lead to some dependencies from internals of LVM and their released versions.

C) not yet tested due to lack of time: could other tools and their superblocks, like gdisk / sgdisk, potentially solve the stacking problem, without needing any changes in the current version of MARS? Or maybe require less work in MARS? Theoretically, because you might be able to nest your new PV / VG / LV instances recursively into GPT, or similar?

Of course, any future solution must cope with MARS-level handover and failover. This needs also some appropriate testing.

Any comments?

If you find a viable solution, I would like to publish it, so others could profit.

Cheers, Thomas

subbergunz commented 2 years ago

I hope I am understanding correctly what is being discussed here:

LVM stacking: I am using this with success since a lot of years, to deliver LVM PVs to a VM, from an LV in the host (I do this to achieve consistency of data across two filesystems of the VM). I am not really sure if this is what is being discussed.

I do this, in lvm.conf in the host:

filter = [ "r|.*undermars.*|", "r|.*zcdppv.*|" ]

because:

"r" here means "reject devices matching this regexp"
all the LVs that I use under mars resources have names undermars-something
one of my mars resources, zcdppv, is in fact a PV that I want to "open" inside the VM, so it must not be open in this host even when it is master of this mars resource and the corresponding device appears

Snapshot at secondary: if the device under mars at the secondary is an LV, I see no problem in stopping the mars resource, waiting for it to be consistent, taking a snapshot, restarting it, and repeat at regular intervals in order to keep a version history. If what is being discussed is making a snapshot inside the protected PV (e.g. we are interested in only one LV inside it, and do not want to waste resources to snapshot other uninteresting LVs), then no, I cannot imagine a way to use this to keep such a version history.

Hope this helps. Best regards, Bergonz

schoebel commented 2 years ago

Hi Bergonz,

I am also unsure whether we might have some misunderstandings of the question from Mark.

In my first answer, I had assumed that Mark wanted to use the hypervisor for stacking of PVs, without involving a VM.

E.g. a rough stack like hardware => lower-PV => lower-VG => lower-LV => MARS-resource -> /dev/mars/$name => higher-PV => higher-VG => higher-LV => application

... and I assumed that all of these layers would be needed for some LVM-level snapshots which could then be preserved after a geo-redundant handover / failover.

Now, hopefully I am understanding your answer correctly. If you use a VM = Virtual Machine on top of MARS (e.g. KVM/qemu), the game will be different, because the VM has a completely new OS instance, including a (potentially different) kernel, or maybe even a Windows instance, or whatever. Of course, the guest system in the VM can then implement its internal shapshots, independently from the hypervisor.

Your answer tells me that you are likely operating a VM on top of MARS (not just testing). Operating stories are interesting for me.

Let us wait for a clarification from Mark's question.

Probably this topic is of interest for more people?

Cheers,

Thomas

marksaitis commented 2 years ago

Hello gentlemen. Thank you for your time explaining, though not sure I understood all in the replies. I will try to explain what I want to do.

Site A (The main site where services live. I need mars under OS) Hardware => lower-PV (Soft raid. Can’t do HW raid in this scenario. I know the perf) => lower-VG => lower-LV md0 => MARS-resource /dev/mars/$name => higher-PV => higher-VG => higher-LV => application (Encrypted Linux OS installed here. Ext4 or btrfs)

<<>>

Site B (Hot backup site) Hardware => lower-PV => lower-VG => lower-LV => MARS-resource /dev/mars/$name => higher-PV => higher-VG => higher-LV => (Here I want to do snapshots. Either lvm ext4 or btrfs) application (Encrypted Linux OS installed here) . So site B sits as a backup + contains daily snapshot for past 7 days.

Objective:

We ensure we have offsite copy of everything. On Site B.
We only sync blocks from site A to B. Site B is where snapshots happen to enable some retention.
We ensure daily retention for last 7 days - site B. So if something gone wrong, we can reverts few days back.
This would also be more efficient than any file level incremental/retention based backups. That's because scanning is not needed!
This is also good as Site B does not need to decrypt. As if normal file based backup would be used as extra to get some retention (daily copies), it would need to see files. And also to scan them...

I am not that much familiar with storage, therefore feel free to adjust the definition of site A and site B if you see something inaccurate.

Marksaitis

schoebel commented 2 years ago

Hi Marksaitis,

hopefully the following ideas can match your objectives more closely.

So you likely have an active-passive = primary-secondary setup, and your daily snapshots are planned at the secondary side, and are read-only, right?

OK, then I believe that the snapshots could be executed at the lower LVM level of the secondary, provided that you have enough disk space there. For example, you know your daily update rate (in units like GB/day delta size) and you have enough spare space at both sides, for avoidance of overflow caused by snapshots.

Here is a rough idea, how it could work using MARS:

Setup: your PV size should be higher than the planned LV size plus 7+1 = 8 incremental snapshot sizes (delta sizes) plus some spare space for risk reduction.

Ordinary operations: as explained in mars-user-guide.pdf . Normally, you are just creating a continuous "backup" at the secondary side.

How to obtain a daily snapshot?

at secondary: marsadm pause-replay $name
wait until the replay has actually stopped (status PausedReplay)
now create an LVM snapshot at the secondary, and give it a name containing the date, e.g. $name-$(date +%F) or similar.
wait and check that the snapshot operation has finalized.
marsadm resume-replay $name

Further steps:

Check that your new snapshot is usable, e.g. make a read-only filesystem mount of /dev/$vg/$name-$date onto /mountpoints/$name-$date with mount option "-o ro" or similar.
Finally, delete the 8th old snapshop, in order to free its space consumption.

Step 7 must not be omitted, otherwise the disk space will fill up over time, until it crashes in some sense.

Caveats:

A) the current version of MARS is not (yet) tested for such a use case. I currently don't know what might happen when your VG fills up to 100.0% due too many snapshots, or due to huge snapshots. For example, there might be a kernel crash I would have to fix, or a volunteer could test it and send me a pull request.

B) Beware that some (modern) filesystem types may need a recovery phase during the read-only mount, and this could take quite some time. For example, xfs is known to write some recovered block even in "-o ro" mode, although the snapshot-device shows some readonly flags.

AFAIK ext4 does a different type of internal recovery, so this should not happen with ext4 (but may consume some kernel memory instead). I have never tried btrfs in this respect, so I don't know.

Hint: the recovery size (fs-level journalling log size) can be controlled in various way, but this depends on the filesystem implementation, and on your application behaviour, and is outside of my scope :(

C) Beware that step 7 can lead to masses of random IOPS.

AFAIK, LVM does not use the COW = Copy on Write strategy by default, but historically has used a BOW = Backup On Write strategy.

There are several potential workarounds for C) I never have really tested (lack of time). For example, newer LVM versions can be configured to use COW or similar update strategies. However, AFAICS it has the following fundamental properties as observed a decade ago:

BOW => relatively expensive update-in-place but cheap deletions of snapshots. COW => cheaper update-in-place but more expensive deletions of snapshots.

I cannot predict which strategy is better for your concrete use case.

If you want to tests this in a lab, take a look at the sister project blkreplay, and read its docs, and feed it with large amounts of your life measurements obtained from blktrace.

Possibly, your workload may be relatively well-suited for any of the LVM snapshot strategies. Then you don't need to spend such an effort. Just test whether it works for you, or use dd for some basic functional tests, without diving into details.

This is only my first rough guess. I would be interested in hearing from operational experiences.

Cheers,

Thomas

P.S. not yet discussed your encryption / decryption topic, for now. Let's first clarify the replication topic + snapshot topic.

marksaitis commented 2 years ago

Thank you for the explanation. I think I understood it very well. And yeah I actually looked more in to btrfs - but it seems that it has file level snapshots only, meaning I want lvm2 and ext4.

Operational experience - whenever I will get to this stage, I can share some results.

Questions:

Apart from a theoretical out of space crash, anything else might be wrong here? These UUID's?
Is it pretty straight forward to deploy OS on the same LV? I guess I do mars during OS install? Or after?
I read something about bad LVM performance when snapshots are active. But then I read some stuff that lvm thin provisioning solves performance issues with snapshots. Maybe you know more? Otherwise no worries.
Snapshotting lower vs higher LV's? Whats the difference here?

schoebel commented 2 years ago

Ok, I try to answer.

professionally, I am not a sysadmin. So I am not directly responsible for operations, e.g. installations / rollouts / etc. But responsible for a customized kernel, and modules like mars.ko (and some more thingies). However, I am working very close to a sysadmin team (and sometimes even doing some operational tasks for them). Thus I don't expect problems if you are a carefully working senior sysadmin, and you read the MARS docs carefully, and if you accept some "learning curve" ;)
No, OS deployment including a pre-configured MARS is currently not just "insert CD and click some buttons". As documented, you currently need to first install your distro, including the necessary build tools for kernel development etc (but also possible on a separate workstation as I do regularly => separation between devel machines and production machines, also for security reasons), and to build a slightly patched kernel (so-called pre-patch) with additional MARS module, and install according to the instructions. And a reboot with the new kernel, of course. 2a. I would appreciate if a major distro would pick uo MARS and include it into their professional-grade releases. My long-term goal: inclusion into the kernel upstream (currently lack of time). 2b. Hmm, trying to deploy the same OS instance (as opposed to a separate VM instance) on top of the LV which carries your application data? No. This type of recursion has nothing to do with MARS. Your partitioning scheme needs to discriminate between "/" (where the OS is living), and /mars which currently needs to be be a separate mountpoint fed by a separate LV, and your user data partition /data/$name , or similar. Depening on the OS and the hardware, some more partitions like EFI & co may be needed. Details may depend on the distro and its releases. 2c. AFAIC there is a dkms script downloadable from OpenSUSE. I haven't tested it, since I do not use dkms (anymore) which was originally constructed for Debian (which I also use personally in addition to OpenSUSE, but I don't use dkms as a development or installation tool).
Yes, thin provisioning is typically needed by the kernel for the snapshots. I am not using thin provisioning for the base LV itself. Why? More than 10 years ago, when I started MARS in my spare time, thin provisioning didn't work as expected (at that time). 3a. Nowadays, your experiences may be much better. AFAIK the Redhat guys invested a lot into LVM2 improvements, and this is highly appreciated by me :) 3b. AFAIC from theory, thin provisioning involves always some overhead in terms of IO latency (measured via blkreplay long ago). Essentially, there is a natural law: tradeoff between IO latency and space consumption from sparse data allocation and their space overhead. In general, If you would fill your basic LV with 100% data, the so-called metadata overhead produced by thin provisioning would lead to >100% total usage. More problems like increases of randomness of access are also possible. So thin provisioning of big LVs could be even counter-productive :( 3c. AFAIC close to 100% fill level could result from plain 24/7/365 operations over the years. Some counter-measures like regular blkdiscard (and fs-level siblings) might help, under certain circumstances which will be OT here. 3d. Similarly: your encryption / decrytion requirements will likely lead to a similar effect. There may be exceptions, like fs-level non-cyptography, or cryptography at HDD level, etc. Beware: if you ever would need fsck & co for repair after hard (!) crashes on top of some encrypted device, static allocations are less vulnerable to data loss. For safe operations, I would simply recommend static LV allocation of the base LV, which is Simple and Stupid(tm), while using thin provisioning at the same time for your backup-like snapshots. 3e. Discplaimer: these are my old experiences. Newer developments might change the picture, but for really SAFE OPERATIONS I would not rely on assumptions or claims, but on testing of failure scenarios.
Good question. See 3-3e. Another argument for testing of certain scenarios, if you want a serious answer.

marksaitis commented 2 years ago

Thank you for the answers. Most is pretty clear. However:

So normal LVM without thin volumes + snapshots - performance might be better than what I read 6 times slower? In your opinion. As I could not find any good benchmarks on this anywhere in the last 5 years.
Patching and compiling kernel - that's ok, I get this part. BUT, I did not understand the answer very well. I have a simple need. Primary site for example has 1 disk sda. Linux OS will live on it together with some data (for example). Can I not use mars to sync this sda disk (which has the main and only OS) to a secondary site? Do you mean I can only sync some other LV, but not the OS itself? Because imagine primary fails, I would like to just activate secondary - so everything would look exactly the same there. Same OS configs, same OS files, same data there...

schoebel commented 2 years ago

1. [...] As I could not find any good benchmarks on this anywhere in the last 5 years.

I currently don't have the time for repetition of my old blkreplay measurements on modern hardware :(

When interested, you can play around with the blkreplay tool.

2. [...]Primary site for example has 1 disk sda. Linux OS will live on it together with some data (for example). Can I not use mars to sync this sda disk

It depends. Here an explanation, and some ideas for potential resolution:

A. Explanation: the /mars partition contains the local state of the replication, e.g. the MARS-level transaction logfiles. Whenever the MARS transaction logger would write a transaction record to /mars, this write IO would be recorded once again (because it goes to /dev/sda), leading to another write of the same block (also on /dev/sda), and so on, in an endless loop.

B. Workaround: spend another disk /dev/sdb which is NOT mirrored via MARS. Then place everything onto it, which might lead to a similar endless recursion.

This sounds easy when just looking at the /mars partition and the MARS transaction logs.

However, this task is more complicated than it looks like. There are some more cases (here my incomplete thoughts):

Whenever the kernel writes something to dmesg, it will finally land on /var/$something or /run/$something (typically via journald etc). In case of any unexpected problems (thus not easily testable), e.g. leading to kernel stack traces, there may be a similar endless recursion, just occuring on the kernel syslog. How to replicate a kernel syslog entry from primary host A to a secondary host B which might believe that the defect had been produced by its own, not by the peer? How to discriminate this? And how to write log entries from B onto the same syslog file containing already the records from A, not overwriting them? And so on?

MARS needs to discriminate the two hosts A and B from each other. It simply uses $(hostname) which delivers different results on each of the machines.

C. Are you sure that you want to also mirror your IP address configs, your MAC address configs, and so on?

Example, for analysis of some unexpected hardware problems: how do you want to login to both A and B in parallel , without involving different IP adresses and different DNS hostnames?

Well, there might be a non-standard solution for the discrimination task, e.g. via some non-standard symlinks pointing from something residing on /dev/sda to some non-replicated local file residing on /dev/sdb , or some non-standard bindmounts, or similar. I have never tried suchalike on modern machines :(

The only thing I have tried was about 1993 when playing with a crude feature of HP-UX: it had a special inode type called "context-specific file" or similar, if I remember correctly. However, the userspace tools were all patched with special commandline options, deviating from standard UNIX commands. Would this be a good idea for today's Linux OS?

Another solution, which is already in production: mirror the / filesystem uni-directionally from a so-called "golden master image" via rsync and some --exclude= regexes, or similar. This is known to work when configured properly and maintained properly. However, certainly not a standard installation of a standard distro ;)

Further ideas?

What about Chef / Puppet / Ansible / and so on? What about the effort to do this correctly?

Well, wouldn't this be a nice OpenSource project for somebody who wants to support MARS from a community approach?

schoebel commented 2 years ago

Sorry. I forgot some further possibilities / solutions, related to your question #2.

Much more explanations are in architecture-guide-geo-redundancy.pdf ;)

If this book is too big for you, concentrate on the section on Dijkstra's rules, for a high-level analysis.

If you need more low-level info, read the relevant examples in this book. Some of them should be hopefully usable for you.

If you have more questions which are NOT treated in this book, please ask be, by posting your questions here.

I am intested in completing this book with some more info I already have.

For example, Bergonz mentioned KVM/qemu. I know that suchalike setups are working on top of MARS. There you have 3 IPs: 1 static IP per hypervisor (each of them residing in a different datacenter), and the 3rd IP is for the location-transparent VM, typically routed dynamically via BGP. This is what end-customers are typically seeing, while the hypervisors are more or less "hidden" (and the hypervisors may use private IPs, not routed to the public internet, and not competing on the precious IPv4 namespaces).

I also mentioned suchalike in public presentations, e.g. LCA or FrOSCon. All are in pdf here, some of them also via Video recording (homepages / youtube).

There is another variant of such a 3-IP setup: use LXC containers in place of KVM/qemu. As I know from experience, it is possible to create location-transparent LXC containers on top of MARS.

The main difference is that you have only 1 kernel instance, where both the hypervisor and the "VM" is running, but just different IP setups, and potentially even different Linux distros (just running under the same kernel, which is possible).

A. Advantage: much better performance, as measured by me some years ago (for a certain use case).

Why? Running 2 or more kernels on the same hardware involves more overhead than expected, depending on workload properties / use cases.

B. LXC needs only 1 kernel per physical machine, and each LXC container can be used as a "light VM" when configured properly.

This needs some more sysadmin effort for setup and operations. But it is also manageable, as I know from experience.

C. Not tested: some more options, see for example the criu project from some eager Russian developers. I would like to hear stories from combination of criu with MARS, e.g. over long distances.

D. I am also working on WIP-prosumer. Checkout this branch from the MARS repo, and look into the docu/ directory there. There are some new chapters in *.pdf you should read. Currently, I have too many other jobs, so there is only very slow progress in this branch. But it is likely to resume by end of this year, because some people wanted to have this future MARS feature ;)

E. Further ideas, after reading docu/architecture-guide-geo-redundancy.pdf : feel free to discuss it here.

marksaitis commented 2 years ago

Thanks for explaining all. I understand IP implications etc. There are solutions there. But let's simply forget site B-Secondary for the moment and concentrate just on A primary for OS install. So ok, we wan't to have this layout, for simplicity:

/dev/sda1 for mount of / (OS goes here + OS data) /dev/sda2 for mount of /boot /dev/sda3 for mount of /mars

How can we use mars here so that we replicate all of this OS to site B on whatever drives there (let's not worry about how B is, as long as we have exact copy it's ok for me. There are flexible options there.)

In simple terms, imagine you boot your normal basic linux computer and use it. But you always know that it's complete copy is saved in site B, as you use it.

So to install mars here, I see 2 options: A. Make mars resources + partitioning during OS install shell via preinstall shell. B. Install mars after OS is installed.

Or perhaps there is another way somebody has done or envisioned? Or one of my guessed options would work? Or there is already a specific way to do it? Is this achievable?

schoebel commented 2 years ago

I am struggling with your requirements. Either I am understanding you incorrectly, or we are probably on different pages.

First a try to clarify:

Essentially, your /boot wouldn't be replicated, only /dev/sda1 needs to be replicated?

Well, then I would suggest that you patch the boot process in some reasonable way. For example, it might be possible to redirect the final mountpoint "/" to an interim mount of /dev/mars/sda1 in place of the usual /dev/sda1 , so the end user might see almost the same as before, but should just see an interim /mars/ substring when typing "df". Possible for you?

Well, this sounds reasonable. I am not an expert in such thingies. AFAICS such a modified setup might be possible during the initrd phase of the boot process. Even the /mars part could be imported into the modified rootmount setup in some way, e.g. bindmounts, or some sibilings like fs namespaces etc etc. If this is your goal, then please open a new discussion thread here. It would be OT in the context of this discussion.

If and only if your goal is different, then here is my next try, to overcome what looks like "conflicting requirements" to me:

Your term "complete copy" looks like you simply want to replicate the complete physical disk /dev/sda regardless what is on it, and regardless for whatever it is used for at any moment in time. Right?

Well, this is non-trivial under the current hardware market conditions. If I could, then I could suggest that MARS is being used inside of the Firmware of the disk. However, you are probably not a hardware manufacturer ;)

Well, if you could use (or would create) an OpenHardware project (like an OpenHardware competition to commerical hardware RAID controllers), there would be much more possibilities. This would be certainly solvable under certain conditions.

Alternatively, the future MARS-based WIP-prosumer feature is essentially a replacement for the locally attached hardware behind /dev/sda, but using TCP/IP protocols (over whatever fabrics) in place of SATA or SAS, or similar.

If you want to do everything not only in software and/or on a given hardware (or even any given hardware) including the entire (!) storage, but also on exactly the OS instance which is to be replicated by itself, this would be a contradiciting requirement AFAIC. Maybe I have overlooked something, so please point me on suchalike.

If not, then I already tried to explain most of the options I know.

Well, in place of /dev/sdb, some nvram would be also possible for /mars . But I guess this was not your question.

Sorry, I am now stuck in this discussion. What did I overlook?

If somebody had a better idea: then I would like to know from such an idea, or if there is already a solution, and whether it works in practice, and in which scale / for which use cases / workload pattern, and in which reliability class, etc. etc.

marksaitis commented 2 years ago

Snapshots seem pretty clear at this moment. Let me create a new discussion for this.