hse-project / hse

HSE: Heterogeneous-memory storage engine
https://hse-project.github.io
670 stars 65 forks source link

DB Backup / Restore Options #4

Open gitspeaks opened 4 years ago

gitspeaks commented 4 years ago

Please provide instructions on how to do point in time online backup of a KVDB and how to do a complete KVDB restore.

smoyergh commented 4 years ago

HSE works with any SSD-backed volume (e.g., LVM, SAN array, cloud volume), so you can take a snapshot of the volume you configured for your capacity media class following the method appropriate for your volume manager. Best practice is to halt the KVDB application to flush all data prior to taking the snapshot, otherwise you will have a crash-consistent snapshot.

If you configure both a capacity and staging media class, you need to snapshot the volume associated with each. In this case, you need to halt the KVDB application to ensure the two volumes are in sync when you take the snapshots.

gitspeaks commented 4 years ago

Thanks. This method is effectively an “offline backup”. Please consider enhancing the engine to support backing up the database while the engine is running to a file that can be simply moved off the machine after the backup completes.

smoyergh commented 4 years ago

We anticipate that HSE will most often be used as part of an application that has its own backup method. For example, we integrated HSE with MongoDB as a proof point, and users of MongoDB with HSE would then likely use one of the several backup methods that MongoDB provides (e.g., mongodump).

That said, there is certainly utility in a native dump/restore method, which we'll consider for future inclusion.

gitspeaks commented 4 years ago

We anticipate that HSE will most often be used as part of an application that has its own backup method.

As far as HSE, what would be the correct way (Api) to enumerate all KV’s in all db’s in order to read a consistent snapshot of the entire dataset ?

gitspeaks commented 4 years ago

Do you include KVDB/KVS snapshot version in each individual KVDB/KVS/KV object ?

smoyergh commented 4 years ago

There is an experimental API (inlcude/hse/hse_experimental.h) to dump and restore a KVDB that likely does what you want. Keep in mind that we use it primarily as an internal utility, so it hasn't been through rigorous testing.

gitspeaks commented 4 years ago

Thanks. I’ll run some tests. Is this export api designed to run concurrently along other reader / writer threads or does it lockout writer threads while it writes out the KVDB to the file ?

davidboles commented 4 years ago

As alluded to above, the import/export API's are experimental. As something designed to be embedded within something else, where the boundaries of a backup interface should be isn't entirely clear. Input from the community would be greatly appreciated.

As to what is currently implemented, there are two experimental API entry points: hse_kvdb_export_exp() and hse_kvdb_import_exp() (these are declared in hse_experimental.h). These are effectively only usable for off-line backup. The first takes an open KVDB handle, a params, and a target path for the output. The second takes an mpool name and a source path.

For hse_kvdb_export_exp(), all of the KVSs in the KVDB are exported and none of them may already be open as of the current code. In turn, each KVS is opened, a cursor over it created, and a work job started to export that KVS. The call to export then waits until all the exports complete and it then closes the KVSs and returns.

To your earlier question: "Do you include KVDB/KVS snapshot version in each individual KVDB/KVS/KV object ?", the answer is "yes". The hse_kvdb_export_exp() function can be enhanced, either a little or a lot ... community feedback is key in charting that path.

One possible enhancement would be to allow it to be called on a KVDB with open KVS's and take an ephemeral snapshot at the beginning of the call so that each KVS would be dumped at the same view. That call would have to return after the export work has started and there would have to be a status interface for the client to check on export progress (including cancelling it, etc.). The enclosing application would then only have to quiesce itself across the call to hse_kvdb_export_exp().

gitspeaks commented 4 years ago

@davidboles Thanks for clarifying!

One possible enhancement would be to allow it to be called on a KVDB with open KVS's and take an ephemeral snapshot at the beginning of the call so that each KVS would be dumped at the same view. That call would have to return after the export work has started and there would have to be a status interface for the client to check on export progress (including cancelling it, etc.).

Yes, these would be key requirements defining the backup part of the solution but I have yet to understand how you manage snapshot version per object in way that supports creating such an ephemeral snapshot.

Alternatively, going back to the initial suggestion - taking a snapshot of the disk volume - Assuming I can tolerate write downtime for the duration of a disk snapshot (which should be several seconds at worst) and ensure in my application code that no write thread interacts with the engine api but reader threads, what should I do to ensure ALL memory buffers affecting the integrity of the stored data are flushed to disk before I invoke the snapshot operation? Can you relax hse_kvdb_export_exp() to work with a pre opened KVS ?

smoyergh commented 4 years ago

You can call hse_kvdb_flush() to ensure the data from all committed transactions, and completed standalone operations, is on stable media.

gitspeaks commented 4 years ago

@smoyergh Thanks, regarding my last query about relaxing hse_kvdb_export_exp() to work with a pre opened KVS, I should have added that this may allow an additional backup option if the dump processes turns out to be quick resulting in low write downtime. The plus here is of-course avoiding the additional complexity of dealing with volume snapshot programs and volume copying for off-machine storage.

smoyergh commented 4 years ago

A method to dump a consistent snapshot of a live (online) KVDB is to create a transaction, which creates an ephemeral snapshot of all KVS in the KVDB, and then use transaction snapshot cursors to iterate over all KV pairs in each KVS.

Long-term we can consider stable APIs for both live dump and pause (allowing time to take volume snapshots).

gitspeaks commented 4 years ago

A method to dump a consistent snapshot of a live (online) KVDB is to create a transaction, which creates an ephemeral snapshot of all KVS in the KVDB, and then use transaction snapshot cursors to iterate over all KV pairs in each KVS.

Having zero knowledge about how things work internally I can only speculate that executing such “Backup transactions” on a live Db may delay reclamation of the storage associated with objects that are modified while traversing and writing out the dump. If that’s true, how can I monitor that impact in terms of increased KVDB/KVS size / IO ?

davidboles commented 4 years ago

Having zero knowledge about how things work internally I can only speculate that executing such “Backup transactions” on a live Db may delay reclamation of the storage associated with objects that are modified while traversing and writing out the dump. If that’s true, how can I monitor that impact in terms of increased KVDB/KVS size / IO ?

The impact would be in terms of space amplification, not I/O. Newer data is found first so the older, as yet un-reclaimed data generally wouldn't be accessed by query activity. We do not currently expose space amplification data. The engine does keep estimates of garbage levels using HyperLogLog-based mechanisms, but we don't publish that info.

Taking a few steps back - if you have a database that is substantially overwritten in the time it takes to perform an export then there will be a space-amp penalty that you will have to account for. Such a database is unlikely to be very large - if it was, you wouldn't be able to overwrite most of it in the export interval.

gitspeaks commented 4 years ago

About reclaiming the garbage, I assume this would be done during compact (e.g hse_kvdb_compact api). Is this the only way available to force GC “now” ? Does it block access to KVDB’s ? Otherwise, how can I know when garbage is actually reclaimed ?

smoyergh commented 4 years ago

Compaction is done in the background, as needed, both for GC and to optimize storage layout. And there are multiple flavors of compaction tied to how we organize storage. The hse_kvdb_compact() API exists to proactively put the system into a known state of "compactedness". We use it internally primarily for benchmarking to get consistent results.

gitspeaks commented 4 years ago

Aside from having things neat for benchmarking I'm not clear on when one would use this api.

hse.h, hse_kvdb_compact description:

In managing the data within an HSE KVDB, there are maintenance activities that occur as background processing. The application may be aware that it is advantageous to do enough maintenance now for the database to be as compact as it ever would be in normal operation.

What info does the engine publish that can be used by the application to determine if it's advantageous to do enough maintenance "now" ?

smoyergh commented 4 years ago

Some applications may have a natural period of time when the load is low, and could choose to call hse_kvdb_compact() as a proactive action to take care of maintenance at that time. The API only compacts down to certain thresholds related to garbage and elements of the storage organization that influence performance, so it won't do any more work than necessary to achieve these thresholds. Hence a metric of when it is advantageous wouldn't be all that beneficial.

But again, we implemented it to support consistency in benchmarks, not because we really need the application to call it for regular operation.

gitspeaks commented 4 years ago

I think it would be beneficial if you expose some performance counters for this process including what you consider as “performance thresholds” to allow users to observe the engine dynamics during a particular workload.

davidboles commented 4 years ago

I think it would be beneficial if you expose some performance counters for this process including what you consider as “performance thresholds” to allow users to observe the engine dynamics during a particular workload.

If you call hse_kvdb_compact() on a KVDB, access to that KVDB is in no way restricted. Further, if you initiate such a compaction operation on an approximately idle database, wait until that completes, and then invoke it again, the call is effectively a no-op. Looking in hse/include/hse/hse.h we find this structure:

struct hse_kvdb_compact_status {
    unsigned int kvcs_samp_lwm;  /**< space amp low water mark (%) */
    unsigned int kvcs_samp_hwm;  /**< space amp high water mark (%) */
    unsigned int kvcs_samp_curr; /**< current space amp (%) */
    unsigned int kvcs_active;    /**< is an externally requested compaction underway */
};

That struct can be retrieved via hse_kvdb_compact_status().

There are in fact many performance counters in the system that are exposed via a REST API against a UNIX socket. We will be documenting that interface in the future. See hse/include/hse/kvdb_perfc.h for those. Not all are enabled by default. You can see usage of the REST interface in hse/cli/cli_util.c.

gitspeaks commented 4 years ago

@davidboles Thanks! this is all reassuring. Hopefully more documentation will arrive soon. BTW, are you aware of any gotchas building the project on Ubuntu?

smoyergh commented 4 years ago

HSE currently supports RHEL 7.7 and 8.1. We have not tested other platforms.

Porting HSE will likely require changes to the mpool kernel module. We are working with the Linux community to get mpool upstream, but that will take some time.

In advance of that, we are looking at adding support for Ubuntu 18.04.4. However, we cannot commit to a specific time frame.

gitspeaks commented 4 years ago

we are looking at adding support for Ubuntu 18.04.4

That would be great!! My environment is based on Ubuntu so I'll continue to experiment with project once you provide the building instructions.

victorstewart commented 4 years ago

@smoyergh I use Clear Linux, so I'll report any issues if I face them. I assume running inside a CentOS/RHEL docker container would not eliminate any potential issues since such issues would arise from inside the kernel?

victorstewart commented 4 years ago

@davidboles

To add my perspective here... Currently building a Redis Enterprise-like clone based around HSE. The fact that distributed databases rely on active replication, and thus you get "backups for free", my only whole database backup needs orbit around seeding new instances/clusters (most likely in new datacenters over the network).

So I've been researching how to best accomplish this today after being pointed in the right direction by @smoyergh.

My intuition is that no serialization scheme such as hse_kvdb_export_exp could compare to the performance (and 0 downtime for free) of 1) creating an LVM snapshot, 2) dd-ing the volume to a file, and then 3) concurrently compressing it to make it sparse?

smoyergh commented 4 years ago

A container uses the host kernel, and we have never tested mpool with the Clear Linux kernel. We are very close to posting our next release, which simplifies mpool considerably and makes it run with a wider range of kernels. So if the current version of mpool doesn't work with Clear Linux, the soon-to-be-released version might.

That said, the upcoming release has only been tested with RHEL 7, RHEL 8, and Ubuntu 18.04.

smoyergh commented 4 years ago

Regarding backup performance, I agree that backing-up an mpool volume snapshot via whatever mechanism is native to the environment (whether LVM, AWS EBS, a SAN array volume, etc.) is going to be faster than the HSE serialization APIs.

victorstewart commented 4 years ago

https://github.com/datto/dattobd

Dattobd seems like a better option than creating a snapshot volume, just create a snapshot file.

victorstewart commented 3 years ago

another variable here is that if you're trying to send your backup over the network (or just replicate your entire dataset to another machine), serialization might actually be faster when the occupancy of your volume is low.

maybe you only have 1GB of data in a 500GB volume. so if you snapshot that, then dd + compress + ssh it over the network, you still have to run through those 500GB.

also if you want to orchestrate this in applications (with requester + sender applications on opposing machines), it's way simpler to just hse_kvdb_export_exp(), pass it over the network and hse_kvdb_import_exp() it.... than to on-demand snapshot with dattobd, transfer the block data, then create the volumes and mount etc.

victorstewart commented 1 year ago

some new thoughts on backups. i'm going to be running my database shards over replicated Portworx volumes, so implicit duplication at the storage layer. but one could also run over BTRFS, and take snapshots of the data directory while the application in running and unaffected (read: no downtime).

alexttx commented 1 year ago

Filesystem snapshots taken while a KVDB is open may result in a corrupt snapshot. HSE would need a "pause" method as described in an earlier comment by @smoyergh.