Current backup & restore process is slow

ono760 commented 6 years ago

I want to be able to quickly backup my prod data and restore it to other environments (dev/staging). The current backup and restore process is slow, sometimes taking hours. Currently for ~500GB of data, a full backup takes ~7.5 hrs to complete; restore takes more than 10 hours to complete.

liangxinhui commented 5 years ago

+1

It seems that only one thread is working from top.

bapBardas commented 5 years ago

Backing-up all influx databases (around 950Gb) with the following command line influxd backup -portable myBackupDir On a VM having 4 CPUs + 16Gb RAM, only between 25% and 50% of CPU is being used and 12Gb out of 16Gb. Started 5h ago : only 44Gb backed-up for now ...

alter commented 5 years ago

I've 16 CPUs and 64GB RAM and it's really slow, b/c I've almost all memory free, about 95% cpu idle and about 100-200 iops on ssd, therefore it's obvious that current issue in backup tool code.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

hbokh commented 5 years ago

Please reopen. Same issue here. TWELVE HOURS for 293M in /var/lib/influxdb/data/ using influxd backup -portable /var/backups/influxdb/raw/... Is this normal behaviour for InfluxDB backups & restores?

retoo commented 5 years ago

@hbokh yap. We no longer backup influx using the built-in tools, we use rsync and disk snapshots.

hbokh commented 5 years ago

We currently do backups like this:

influxd backup -portable /var/backups/influxdb/raw/

They take a long time to finish (> 12h).

But from the docs, I found the -host option, so I gave it a try:

influxd backup -portable -host 127.0.0.1:8088 /var/backups/influxdb/raw2/

ALMOST instant!

retoo commented 5 years ago

the problem is the restore, not the backup :)

hbokh commented 5 years ago

the problem is the restore, not the backup :)

Well, the restore was a breeze too here, using a backup from 1.6.4 restored into 1.7.9:

influxd restore -portable -host 127.0.0.1:8088 /var/backups/influxdb/raw2/

Still testing stuff TBH.

retoo commented 5 years ago

for larger databases? ours takes days

du -chs /var/lib/influxdb
471G    /var/lib/influxdb

hbokh commented 5 years ago

du -chs /var/lib/influxdb

We're in a different league here 😁

# du -chs /var/lib/influxdb
301M    /var/lib/influxdb

e-dard commented 5 years ago

@retoo could you provide logs of the server during this time? I particularly interested if your caches cannot be snapshotted by the backup tool because they're busy handling write workloads.

e-dard commented 5 years ago

I tested this on some data in a local instance:

edd@tr:~|⇒  du -chs ~/.influxdb
11G     /home/edd/.influxdb
11G     total

No write workload on influxd

edd@tr:~|⇒  time ossinfluxd-1.8 backup -portable ~/backup-test
2019/11/15 13:27:27 backing up metastore to /home/edd/backup-test/meta.00
2019/11/15 13:27:27 No database, retention policy or shard ID given. Full meta store backed up.
2019/11/15 13:27:27 Backing up all databases in portable format
2019/11/15 13:27:27 backing up db=
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=1 to /home/edd/backup-test/_internal.monitor.00001.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=3 to /home/edd/backup-test/_internal.monitor.00003.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=5 to /home/edd/backup-test/_internal.monitor.00005.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=7 to /home/edd/backup-test/_internal.monitor.00007.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=9 to /home/edd/backup-test/_internal.monitor.00009.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=11 to /home/edd/backup-test/_internal.monitor.00011.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=_internal rp=monitor shard=13 to /home/edd/backup-test/_internal.monitor.00013.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=db rp=autogen shard=2 to /home/edd/backup-test/db.autogen.00002.00 since 0001-01-01T00:00:00Z
2019/11/15 13:27:27 backing up db=db2 rp=autogen shard=14 to /home/edd/backup-test/db2.autogen.00014.00 since 0001-01-01T00:00:00Z
2019/11/15 13:28:04 backup complete:
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.meta
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s1.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s3.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s5.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s7.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s9.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s11.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s13.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s2.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.s14.tar.gz
2019/11/15 13:28:04     /home/edd/backup-test/20191115T132727Z.manifest
ossinfluxd-1.8 backup -portable ~/backup-test  216.09s user 18.42s system 615% cpu 38.127 total

Then, with a constant write workload on influxd:

edd@tr:~|⇒  time ossinfluxd-1.8 backup -portable ~/backup-test
2019/11/15 13:30:37 backing up metastore to /home/edd/backup-test/meta.00
2019/11/15 13:30:37 No database, retention policy or shard ID given. Full meta store backed up.
2019/11/15 13:30:37 Backing up all databases in portable format
2019/11/15 13:30:37 backing up db=
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=1 to /home/edd/backup-test/_internal.monitor.00001.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=3 to /home/edd/backup-test/_internal.monitor.00003.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=5 to /home/edd/backup-test/_internal.monitor.00005.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=7 to /home/edd/backup-test/_internal.monitor.00007.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=9 to /home/edd/backup-test/_internal.monitor.00009.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=11 to /home/edd/backup-test/_internal.monitor.00011.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=_internal rp=monitor shard=13 to /home/edd/backup-test/_internal.monitor.00013.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:37 backing up db=db rp=autogen shard=2 to /home/edd/backup-test/db.autogen.00002.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:38 backing up db=db rp=autogen shard=15 to /home/edd/backup-test/db.autogen.00015.00 since 0001-01-01T00:00:00Z
2019/11/15 13:30:38 backing up db=db2 rp=autogen shard=14 to /home/edd/backup-test/db2.autogen.00014.00 since 0001-01-01T00:00:00Z
2019/11/15 13:32:18 backup complete:
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.meta
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s1.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s3.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s5.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s7.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s9.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s11.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s13.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s2.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s15.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.s14.tar.gz
2019/11/15 13:32:18     /home/edd/backup-test/20191115T133037Z.manifest
ossinfluxd-1.8 backup -portable ~/backup-test  243.15s user 27.64s system 264% cpu 1:42.27 total
edd@tr:~|⇒

So when the server is handling writes and the cache is bust the backup took 64 seconds longer (102 seconds versus 38 seconds).

e-dard commented 5 years ago

@retoo @hbokh can you verify your influxDB versions please?

hbokh commented 5 years ago

@retoo @hbokh can you verify your influxDB versions please?

Thank you for re-opening, @e-dard ! I am currently working on a migration to a new InfluxDB-server, from v1.6.4 to v1.7.9. As said, adding the -host 127.0.0.1:8088 option solved my specific issue.

In a discussion at work it was mentioned that the long backup might have to do with the process switching between the IPv6 and IPv4 address; timing out; retrying; etc.

smazac commented 5 years ago

There is a huge difference of behaviour between windows and linux by the way on version 1.7.8. For the same database, on windows backup&restore take ~5min and dump files are ~12Mo on disk, while on Linux, backup&restore perform instantly and files are ~3Mo.

e-dard commented 4 years ago

I think one of the issues is that when doing a local backup everything is still streamed out over a TCP connection. It seems more sensible to handle this usage differently. Rather than streaming hard-linked files over a TCP connection, we should just create archives directly off of the source files.

jacobmarble commented 4 years ago

@e-dard please clarify. To me, "create archives directly off of the source files" implies locking up the live data for a much longer window of time compared to creating hard links. Perhaps I don't understand the problem fully.

jacobmarble commented 4 years ago

Ah, local backups are different, yes. I'm still unclear re why we wouldn't create hard links, but I agree that avoiding the TCP layer locally would help.

lesam commented 3 years ago

Next steps: set up some realistic workload tests and figure out if we're saturating bandwidth or disk read speed (and especially if neither, why not).

The -host flag is interesting. The default is localhost:8088. It is surprising changing to 127.0.0.1:8088 would change things so much, we need to try to repro and understand that.

davidby-influx commented 2 years ago

One thing to note is that issue is really two issues: OSS backup and enterprise backup (there is an engineering assistance request that has been closed in favor of tracking here) . While some fixes to OSS backup will benefit Enterprise, there are other steps which will be Enterprise only. We may need to split this ticket in two in the OSS and Enterprise repos after we have an initial design for the improvements and see if there is work in both.

davidby-influx commented 2 years ago

Backup timings in seconds comparing the default localhost:8088 and the -host 127.0.0.1:8088 parameter. Backups are about 6 GB. There may be differences between localhost and 127.0.0.1 in going through the network stack, and, of course, localhost can be mapped to a different address (like 127.0.1.1). In the timings below, 127.0.0.1 uses less computational time (user+sys) by a small amount, but more real time. I suspect local configuration and the installed network stack would affect whether one or the other is preferable, but the default should remain localhost.

@lesam @hbokh - TL;DR - don't see much difference between 127.0.0.1 and localhost

real 127.0.0.1	real localhost	user 127.0.0.1	user localhost	sys 127.0.0.1	sys localhost
18.604	21.675	13.246	24.061	12.666	21.082
22.33	20.617	17.168	16.705	16.288	15.539
25.282	23.504	17.718	16.4	16.288	15.504
24.618	19.22	17.337	14.829	15.925	14.013
--	--	--	--	--	--
22.7085	21.254	16.36725	17.99875	15.29175	16.5345

lesam commented 2 years ago

@davidby-influx I think we discussed this since this comment, but for posterity - let's focus on why backing up a few hundred GB is taking hours, e.g. the private issue linked above where a backup of 125GB took > 7 hours, that seems too long. And it maxed out disk IO. Seems like we have some inefficient pattern when backup is in progress. Would be interesting to know if backup is still the same time when queries/writes are turned off.

The backup progress in that issue is running at ~ 5MB/s for each shard on extremely fast disks, something seems wrong.

lesam commented 2 years ago

And it looks like your tests aren't catching whatever the issue is because they're running at ~6GB/25s = 240MB/s

danatinflux commented 2 years ago

@lesam Just for the record, the backup that's in the EAR (2885) was of a clone that should have had no incoming or outgoing traffic except for the backup.

davidby-influx commented 2 years ago

@lesam - we have the private issue mentioned above, which is a clearly pathological case, and a bug of some sort, then we have the general inefficiency of the backup process in both OSS and Enterprise. We need to solve both, IMHO; find the bug, and do things like parallelize backup in general.

davidby-influx commented 2 years ago

My tests were just to check the comment above that the -host parameter made a huge difference, not an attempt to replicate the private issue linked above.

lesam commented 2 years ago

@davidby-influx the top comment on this issue:

I want to be able to quickly backup my prod data and restore it to other environments (dev/staging). The current backup and restore process is slow, sometimes taking hours. Currently for ~500GB of data, a full backup takes ~7.5 hrs to complete; restore takes more than 10 hours to complete.

That's about 20MB/s, which still seems quite slow for a single node of OSS - sounds closer to our pathological case than just wanting a bit of speed improvement.

davidby-influx commented 2 years ago

Internal developer notes:

The use of io.CopyN here wraps the io.Writer in a LimitReader which prevents it from using alternate, perhaps faster interfaces, like ReaderFrom. The io.CopyN seems like a safety measure to snapshot the file that may still be growing, but it could also have a detrimental effect on performance.

Update: There seems to be no way to get ReaderFrom and WriterTo interfaces when using archive/tar and os.File, so CopyN is not an issue.

davidby-influx commented 2 years ago

Internal developer notes: On a 6-data node cluster, running on my local machine under docker, I'm seeing backup speeds like this:

Backed up to . in 13.321069249s, transferred 2141312247 bytes Backed up to /root/.influxdb/backies in 12.497472908s, transferred 2144231159 bytes

So between 160 and 172 MB / second. Not terribly fast, but not pathologically slow.

The pathological case above (500GB / 7.5 hours) is about 1/9th as fast.

danatinflux commented 2 years ago

@davidby-influx Would you like me to spin up a cluster for you to test in AWS? I feel that maybe the drag comes from the inter-nodal communication (meta node doing the backup <-> data node(s)) and if you're running it on docker it doesn't have to leave your laptop, right? :)

davidby-influx commented 2 years ago

@danatinflux - yes, please. Let's spin up a copy/clone of something that we have seen backup-slowness on, if possible.

timhallinflux commented 2 years ago

I think on a laptop you are going to see lots and lots of disk contention.

davidby-influx commented 2 years ago

@gwossum - you had some interesting results yesterday....

davidby-influx commented 2 years ago

I think on a laptop you are going to see lots and lots of disk contention. (@timhallinflux)

But, we have reports of OSS backups performing terribly, presumably on the same machine which were resolved by explicitly specifying the IP and other weirdness. So I am hoping that we can get a local repro of some problem involving local TCP/IP that perhaps affects both OSS and Enterprise.

gwossum commented 2 years ago

Some assorted findings about backups and network throughput:

Performing a backup out of a EAR scaffold (docker-compose) cluster to the host machine seems to be a CPU-bound process.
- I was backing up several GB of data, but still small enough for it to be cached in memory while waiting for the disk to catch up. It's possible if I was doing more data it would become a disk-bound process.
Switching to doing the backup over 2 computers connected through a 1GbE switch changed the backup process to a network-bound process.
The DownloadShardSnapshot RPC is essentially a file transfer operation, just like scp. DownloadShardSnapshot was also slower than using scp on the exact same file. It was normally about 5% to 20% slower than scp, normally closer to the 20% slower side. It's even worse when you consider I did not have TLS configured for InfluxDB, so there was no TLS data or CPU overhead.
When I started out, my personal laptop for some reason auto-negotiated to 100 Mbit. Even at 100 Mbit, the DownloadShardSnapshot RPC was 5% to 20% slower than scp. I think this is a pretty good indicator that we are not using the network bandwidth efficiently, rather than being limited by disk I/O or CPU.
To completely eliminate disk I/O from the equation, I also ran tests with influxdb running on RAM disk (tmpfs), and the remote backup / scp saving the data to RAM disk. Exact same results, 5% to 20% slower than scp.
The bulk of the work of DownloadShardSnapshot is done using io.Copy. If the Reader and Writer don't implement ReaderFrom or WriterTo interfaces, then io.Copy falls back to transferring data using an internal 32 kB buffer. There are artifacts of this 32 kB buffer in the shape of the data going over the network.
- There is an io.CopyBuffer function that allows passing in your own buffer. A larger buffer and/or a buffer which is multiple of the underlying network interface MTU may improve throughput
- Making sure we pass-through the WriterTo and ReaderFrom interfaces may be helpful as well
The current implementation of tlv.EncodeTLV causes 3 TCP segments + ACKs to be sent over the wire: One for the length byte, one for the type byte, and one for the value. This can be improved and might help RPC heavy operations, but it probably won't help the backup.
Improving typical backup performance by 20% would be pretty good, but it doesn't feel like that's a sufficient explanation for the report. It is, however, possible that there are additional networking issues that show up on a higher performance data center LAN.

davidby-influx commented 2 years ago

In the comment above, the Go tar package seems to prevent ReaderFrom use. It may be difficult with the current Go packages to use ReaderFrom or WriterTo interfaces.

davidby-influx commented 2 years ago

I tested implementing ReadFrom() on the tar.Writer type (because a tar.Writer.readFrom()exists in the source file in Go 1.17). Because of the layering of the tar.Writer the call does no end up using sendfile or copy_file_range even when tar.Writer.readFrom is exported. I believe we will have more luck tuning socket parameters and buffer sizes than trying to force the use of optimized system calls.

davidby-influx commented 2 years ago

Dropping a bufio.Writer of 1,024,000 bytes in between the tar.Writer and its underlying io.writer has no statistically discernible effect on performance.

In stream.go

func Stream(w io.Writer, dir, relativePath string, writeFunc func(f os.FileInfo, shardRelativePath, fullPath string, tw *tar.Writer) error) (rErr error) {
    bw := bufio.NewWriterSize(w, 1000 * 1024)
    tw := tar.NewWriter(bw)
    defer func() {
        errors2.Capture(&rErr, tw.Close)()
        errors2.Capture(&rErr, bw.Flush)()
    }()

gwossum commented 2 years ago

I am unable to reproduce pathologically slow (10x slowdown) backups under "normal" conditions. However, a brand new cloud1 clone seems to consistently show the issue. After an initial very slow backup, subsequent backups will proceed at expected speed. Further investigation into how potato creates clones is required.

davidby-influx commented 2 years ago

Nice work!

samhld commented 2 years ago

@gwossum Cool -- so do I understand this correctly? It's possibly an issue with Potato and not something customers are experiencing (assuming they're not committing whatever sin Potato might be committing)?

timhallinflux commented 2 years ago

No. There are plenty of issues from community members about slow backups.

bapBardas commented 2 years ago

Yes I confirm, I always used only the community version and faced those issues.

danatinflux commented 2 years ago

I know you probably checked this already, but I have to ask---what's the size delta between the initial backup and the subsequent backup? I want to make sure that the 'second backup' isn't an incremental that's disguising itself as a full.

gwossum commented 2 years ago

@danatinflux The subsequent backups are also full backups. Using -strategy full and a different output directory.

gwossum commented 2 years ago

@samhld @timhallinflux There are probably multiple issues that can cause slow backups, but the one involving slow backups after cloning a cloud1 cluster are the ones that I currently have a repro for.

davidby-influx commented 2 years ago

We are seeing this in OSS, on other machines than our cloud 1, so I am hoping that the problem is not specific to our hosting, but our repro helps us pinpoint the issue.

gwossum commented 2 years ago

Discussion of slow backups on cloud1 clones has been moved to a new ticket in a different repo: https://github.com/influxdata/plutonium/issues/3816

influxdata / influxdb

Current backup & restore process is slow #9984

No write workload on influxd