schoebel / mars

Asynchronous Block-Level Storage Replication
GNU General Public License v2.0
233 stars 35 forks source link

Discard support #22

Open schefi opened 4 years ago

schefi commented 4 years ago

Hello,

First of all let me thank you for your amazing work! I would like to inquire whether it would be difficult to add Trim/Discard support to the Mars block device?

I will describe my use case, maybe you can offer some suggestions. We are in the middle of some infrastructural changes at the technical faculty of the local university. Currently only one major is affected, be we would like to involve the whole faculty someday. I found your work about Football and Sharding just perfect for our needs. Let me describe: At the moment we use Mars on a pair of servers to provide hot failover for VMs in case one of the hosts fails. There are a lot of VMs, both belonging to the infrastructure (like learning management system, teleconferencing server, etc.) and belonging to remotely accessible virtual labs for the students to practice on. This is basically a tiny (two node) cluster managed with OpenNebula, it stores all LXC containers and full VMs in qcow2 images. I have the images on top of a filesystem on top of a single Mars device. In case primary server would fail, I would promote the Mars device on the secondary and continue to run (almost) from where the primary failed. What we would like to do is scale out to more then two servers, and in case of failure we just increse the density of VMs on the remaining hosts. Your Football solution with sharding would be great for this. I plan to write an OpenNebula storage driver which would involve Mars football and LVM thin volumes. Current OpenNebula storage drivers only supports thick LVM (which is not good for snapshoting), so that needs some adjustment anyway. While at is I also would like to use this new storage driver as a shared block storage, which makes it necessary to implement a version of the Football concept with sharding in OpenNebula. The idea is the following: If my proposed driver would create a Mars resource for every VM instance, and then one thin LVM pool on top of Mars, I would have the option of creating efficient snapshots of the VM. Once the cluster manager schedules the VM to another host, the Mars device is promoted, and all the relevant volumes to that specific VM will be available on the new host, along with the snapshots, giving the chance to revert to previous snapshots, if needed. The problem is that I would have to allocate a large space for the backing device of the Mars device in advance to accomodate the potential growth of the image and the space requirement of the snapshots. In order to avoid comitting much more space than needed in advance, and in order to reclaim the space that is freed up when deleting snapshots, it would be great for the Mars device to pass down discard requests to the underlying storage. Currently my only idea to save storage space on raw disk without discard support is to create the Mars device on top of kVDO (on top of HW-RAID), but that still needs that empty space be overwritten with zeroes at some regular intervals, which would produce a large amount of unnecessary replication traffic as well.

I would like to ask how difficult would it be to implement discard and whether You plan to implement it in the near future. So far (without the hopes of me understanding the whole Mars code) I found in mars_if.c that the BIO_RW_DISCARD property after all gets copied from orig_biow, which I hoped to mean that if the underlying device supports it, than Mars would also support it. Obviously I'm not getting it right, because no matter what I do, the Mars device itself always has a discard_granularity of 0, regardless of the underlying device's discard capability.

Thank a lot, schefi

schoebel commented 4 years ago

Hi Schefi,

in theory, discard support should not be extremelöy hard. I have planned it for a long time ago, but up to now nobody ever asked for it.

Your use case is clear to me. You don't need to convince me: it is a required feature, somewhen in future.

Technically, the transaction logfile needs to be augmented with an empty record (no data) but describing the discard command. Reason: the secondaries need to replay it in the correct order. The discard command must not commute with any other write.

A tricky part could be the current writeback strategy at the primary side, which is out-of-order.

A relatively simple but less performant solution would be to enforce a complete drain of the writeback queue while pausing all incoming write requests, then issue the discard, then continue normally. This type of per-source lock would be a bottleneck for a limited time.

Theoretically, It could be possible to implement some more parallelism. If we had two (or more) writeback queues instead of one per resource, then the draining of the old queue could run in parallel to the queueing-up of the new writeback queue.

My main problem is: I have to maintain an SLA of 99.98% at 1&1 Ionos. Any change to the transaction logger == the very heart of MARS, needs to be tested very thoroughly. If you require a full-fleged parallelized solution where discard needs to be buffered inbetween of two writeback-queue instances, it will result in much more work, and in much more testing.

The question is: would the lesser performance of the simple solution be acceptable, at least for now?

I think the answer also depends on the frequence of discard commands. Also, it would be helpful to have multiple resources, so their respective discards could run in parallel to each other (or to ordinary IO).

Essentially, both solutions are not exclusive-or.

My personal work queue is very full at the moment: in addition to WIP-scalablitiy-resource, I will push WIP-CRC32C shortly (since I have already invested a few man months into it). Afterwards, the remote device is on the current roadmap, needed for 1&1.

So I am not sure how to phase your request into my current roadmap.

If somebody could help me, it would be great.

I would be very intested in your contributions to OpenNebula. Please consider submitting patches to either the OpenNebula project, and/or into the MARS repo into a separate contrib/opennebula/ directory for which you could take over the maintainer role (if you like).

I am not sure whether I have analyzed all the pitfalls correctly at the moment. Please give me a few nights to sleep over it.

Cheers,

Thomas

schefi commented 4 years ago

Dear Thomas,

Thank you for the quick reply. I can understand that your work queue is full. I just add some new features/services and try to widen the faculty's teaching methods in my free time. Experimenting with ways to implement the virtual student labs and giving some classess at the university are both only part time jobs for me. I don't engineer replication in one of the largest hosting companies, also I don't even have an SLA at all, and even so my own work queue is almost full with the university tasks. So I perfectly understand that you are busy. I actually would have been bit surprised, if you said otherwise.

I intended to contribute the OpeNebula code anyway, but as of yet it is only a plan. I have not written a single line yet. You saying some promising things about getting discard to work might actually trigger me starting this project, as the lack of discard so far made me reluctant to start reworking the thick LVM driver and to touch the Mars Football question.

I would offer my help in the kernel module development, but I think I may be short of competence for this. So far I have only indirectly contributed to the Linux kernel with bug reports and trace logs. It just occured to me that there was a bug in bcache I reported, and it was also related to discard. The way blk_queues were split with writeback mode caching caused some problems, until they fixed it in 5.1 a bit more than a year ago. Seems this problem is circling me. Sorry, don't want to wander too far from topic.

As to answer your question: YES, it would be great even to have discard in the simpler, less performant way, with pausing and flushing the queue before executing the discard. Discards don't have to happen too often, the main source of them are going to be the deletion of snapshots (in my case), which can be cron-scheduled to once a day or maybe once a week. A little stall in I/O once a day is not even going to be noticed, I think. I don't need immediate discards on filesystem level deletions either, so a scheduled fstrim is completely OK. I think that even the simpler solution would be perfectly fine for most. Another good news is that in my intended use case there will be multiple independent resources, one for each VM/container instance, this can be useful as you mentioned, as not the whole host stalls on a dicard, just the queue of a single resource.

As I mentioned before I don't feel capable to write this myself, as it would require me to immerse myself in the depths of Mars source code, but maybe I can help you with some testing, some trial and error, recompile-test-report scenarios for a discard-enabled branch, or maybe even write some code myself if you point me in the right direction.

Thanks, schefi

schoebel commented 4 years ago

Hi Schefi,

there is small progress: today I talked with by boss. Discard support is now on my internal 1&1 Ionos roadmap, but with "background" priority.

Before starting a new brach WIP-discard-support, I want to fix all the bugs in WIP-CRC32C before I can push this branch.

Reson: for backwards compatibility and for mixed operations of different mars.ko versions, the minimum features in the whole cluster must be computed. This computation is already in WIP-CRC32C. The new "discard" feature will be only activated when all cluster members can support it.

I hope to be able to push WIP-CRC32C in a few weeks. There is 1 important bug left to fix, and some heavy testing is then also required.

Cheers,

Thomas

schefi commented 4 years ago

Thank you, that's great news!

alexpacio commented 6 months ago

I'm also interested in this feature, is there any progress?