Timeout Generating Snapshots of Large Volumes

binghamchris commented 8 years ago

When attempting to shapshot a large (multi-GB) volume, convoy reports a timeout executing the gzip command.

The convoy command run is:

convoy snapshot create $vol_name --name $snapshot_name

And the resulting error message is:

ERRO[0062] Error response from server, Timeout executing: gzip [/var/lib/convoy/vfs/snapshots/5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp], output , error <nil>

{
        "Error": "Error response from server, Timeout executing: gzip [/var/lib/convoy/vfs/snapshots/5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp], output , error \u003cnil\u003e\n"
}

Manually executing gzip on the file shows that takes about 2.5 minutes to complete:

# ll -h 5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67*
-rw-r--r--. 1 root root 4.2G Apr 11 08:27 5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp
# date ; gzip 5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp ; date
Mon 11 Apr 08:31:26 CEST 2016
Mon 11 Apr 08:33:55 CEST 2016
# ll -h 5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67*
-rw-r--r--. 1 root root 4.0G Apr 11 08:27 5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp.gz

Looking at the contents of the Convoy VFS directory, it seems that Convoy is giving up and terminating the gzip process after about a minute:

# ll -hrt --full-time /var/lib/convoy/vfs/snapshots/5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67*
-rw-r--r--. 1 root root 4.2G 2016-04-11 06:37:46.500756530 +0200 /var/lib/convoy/vfs/snapshots/5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp
-rw-------. 1 root root 1.6G 2016-04-11 06:38:46.501478969 +0200 /var/lib/convoy/vfs/snapshots/5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp.gz

I've rummaged through the code here, but can't find either where this timeout is set or where it can be configured. I'm wondering if the timout is intentional or not and I'd like to find out how to overcome it please.

Observed with Convoy 0.4.3 on CentOS 7.2 and Docker 1.10.3 on a NUC 6I5SYK (Core i5-6260U, 32GB RAM, PCIe SSD). The multi-GB volume came about as a result of using Convoy to manage volumes for a container running GitLab CE, which uses Git LFS to version control large binary objects (in this case, an ISO).

binghamchris commented 8 years ago

Just found a note in issue https://github.com/rancher/convoy/issues/73 indicating that there's a non-configurable 60 second timeout for all commands.

So I had a bit of a rethink and tested pigz (parallel gzip), and found that it just bearly manages to compress the 4.1GB volume snapshot in under a minute on my system:

# date; pigz 5204cf2a-5a95-4362-b853-7a11b7f07b33_724390fb-2ca6-4590-82b7-50d1a6fbde67.tar.gz.tmp ; date
Mon 11 Apr 11:25:07 CEST 2016
Mon 11 Apr 11:26:06 CEST 2016

So I'd like to propose two things:

Reconsider the hard 60 second design limit to account for larger volumes; I'd propose make it a configurable timeout that defaults to 60 seconds instead
Consider using pigz or some other parallel gzip to accelerate performance for larger volumes. Potentially this could be a switch in the convoy daemon command so that it can be enabled on systems with pigz installed

yasker commented 8 years ago

Hi @binghamchris

You should able to use --cmd-timeout parameter for latest release at: https://github.com/rancher/convoy/releases/tag/v0.5.0-rc1

binghamchris commented 8 years ago

Hi @yasker

Thanks to you and @jinuxstyle for the update :) However I'm afraid there's appears to be a bug in v0.5.0-rc1. While attempting to test the update I ran into this:

# convoy snapshot create gitlab-prod-data --name snap-gitlab-prod-data-1463237239 --cmd-timeout 10m
Incorrect Usage.

NAME:
   convoy snapshot create - create a snapshot for certain volume: snapshot create <volume>

USAGE:
   convoy snapshot create [command options] [arguments...]

OPTIONS:
   --name       name of snapshot

ERRO[0000] Error when executing command: flag provided but not defined: -cmd-timeout
{
        "Error": "Error when executing command: flag provided but not defined: -cmd-timeout"
}

The position of the --cmd-timeout argument doesn't affect the outcome; the same error message is displayed regardless.

I found that a "CmdTimeout":"" was present in /var/lib/rancher/convoy/convoy.cfg, so I set that as follows:

{"Root":"/var/lib/rancher/convoy","DriverList":["vfs"],"DefaultDriver":"vfs","MountNamespaceFD":"","IgnoreDockerDelete":false,"CreateOnDockerMount":false,"CmdTimeout":"10m"}

However this doesn't appear to have had any impact. Running a snapshot command against a large volume (the same one that prompted me to file this issue orginally actually) still timed out after 60 seconds:

# date; convoy snapshot create gitlab-prod-data --name snap-gitlab-prod-data-1463237239; date
Sat 14 May 17:04:21 CEST 2016
ERRO[0062] Error response from server, Timeout executing: gzip [/var/lib/rancher/convoy/vfs/snapshots/gitlab-prod-data_snap-gitlab-prod-data-1463237239.tar.gz.tmp], output , error <nil>

{
        "Error": "Error response from server, Timeout executing: gzip [/var/lib/rancher/convoy/vfs/snapshots/gitlab-prod-data_snap-gitlab-prod-data-1463237239.tar.gz.tmp], output , error \u003cnil\u003e\n"
}
Sat 14 May 17:05:23 CEST 2016

I've tried to figure out where the issue may be in the code, however I'm afraid it's too sophisticated for my limited coding abilities.

I'm happy to provide any other testing or information I can help though :)

jinuxstyle commented 8 years ago

The position of the --cmd-timeout argument doesn't affect the outcome; the same error message is displayed regardless.

The --cmd-timeout option should be specified when starting the convoy daemon. And it's not an valid option for the command line for convoy client as the way you used.

I found that a "CmdTimeout":"" was present in /var/lib/rancher/convoy/convoy.cfg, so I set that as follows:

Did you edit the convoy.cfg manually? If yes, you should restart the convoy daemon to make it effective.

binghamchris commented 8 years ago

Apologies @jinuxstyle, I misunderstood where this was supposed to be used. It's now working as intended when the --cmd-timeoutoption is used as you directed... I can backup my GitLab volumes again thanks! 😄

rancher / convoy

Timeout Generating Snapshots of Large Volumes #99