stashed / stash

🛅 Backup your Kubernetes Stateful Applications
https://stash.run
Other
1.32k stars 86 forks source link

Restic uses a lot of memory on directories with many files #1504

Open sgielen opened 1 year ago

sgielen commented 1 year ago

This is a known Restic issue (see e.g. restic/restic#2446) and was supposedly improved in restic release v0.14.0. I've created #1503 to bump to v0.15.1 and am checking how I can build this version and test it locally.

sgielen commented 1 year ago

For one, it looks like the Makefile has some assumptions about the git branch name:

container: stashed/stash:feature/bump-to-restic-0-15-1_linux_amd64
[+] Building 0.0s (0/0)                                                                                                                                                                                 
ERROR: invalid tag "stashed/stash:feature/bump-to-restic-0-15-1_linux_amd64": invalid reference format
make: *** [Makefile:228: bin/.container-stashed_stash-feature/bump-to-restic-0-15-1_linux_amd64-PROD] Error 1

retrying with just bump-to-restic-0-15-1

sgielen commented 1 year ago

I don't believe I can run the e2e tests myself, since they need TEST_CREDENTIALS and a GOOGLE_SERVICE_ACCOUNT_JSON_KEY so there's no e2e testing locally? The unit tests succeed, but since I've updated just the restic binary that probably doesn't indicate anything.

sgielen commented 1 year ago

From .github/workflows/e2e.yaml I gather I can just git clone https://github.com/stashed/installer.git in the parent directory where the stash repo is cloned, then run make install to install to the current cluster. This installs Stash, but using the image stashed/stash:bump-to-restic-0-15-1_linux_amd64 which of course the minikube Kubelet can't pull.

Probably I could extract the image locally and then import it in Minikube, but it should also be possible to build it directly on Minikube's Docker daemon:

$ eval $(minikube docker-env)
$ make container
# this needs a patch to the Makefile. With `bin/$(OS)_$(ARCH)/$(BIN) already built:
-bin/.container-$(DOTFILE_IMAGE)-%: bin/$(OS)_$(ARCH)/$(BIN) $(DOCKERFILE_%)
+bin/.container-$(DOTFILE_IMAGE)-%: $(DOCKERFILE_%)
# because the build for bin/.../stash fails as the Docker daemon can't find the local files necessary.
$ make install

And indeed:

$ kubectl get pods -n kube-system
NAME                                     READY   STATUS    RESTARTS      AGE
coredns-787d4945fb-lhvb8                 1/1     Running   0             18m
etcd-minikube                            1/1     Running   0             18m
kube-apiserver-minikube                  1/1     Running   0             18m
kube-controller-manager-minikube         1/1     Running   0             18m
kube-proxy-gmfcg                         1/1     Running   0             18m
kube-scheduler-minikube                  1/1     Running   0             18m
stash-stash-community-6886f746fd-vdwnn   2/2     Running   0             16s
storage-provisioner                      1/1     Running   1 (17m ago)   18m

Indeed the new Restic version is used:

# kubectl exec -ti -n kube-system stash-stash-community-6886f746fd-vdwnn -c operator -- restic version
restic 0.15.1 compiled with go1.19.5 on linux/amd64
sgielen commented 1 year ago

Looks like I can run the e2e tests now. During the tests, the sidecar has the new version of Restic;

$ kubectl exec -ti -n test-stash-xifp4i source-ss-stash-e2e-rh7kyx-1 -c stash -- restic version
restic 0.15.1 compiled with go1.19.5 on linux/amd64

And the e2e test seems to succeed against backend MinIO:

Running e2e tests:
ginkgo -r --vv -race --progress --trace --flake-attempts=2 --timeout=4h test -- --docker-registry=stashed --image-tag=bump-to-restic-0-15-1_linux_amd64
E0304 14:14:38.003504      69 options_test.go:72] no such flag -logtostderr
Running Suite: e2e Suite - /src/test/e2e
[stash] 
========================================
Random Seed: 1677939275

Will run 116 of 116 specs
------------------------------
[BeforeSuite] 
/src/test/e2e/e2e_suite_test.go:66
[BeforeSuite] TOP-LEVEL
  /src/test/e2e/e2e_suite_test.go:66
STEP: Using test namespace test-stash-xifp4i 03/04/23 14:14:38.27
STEP: Deploy TLS secured Minio Server 03/04/23 14:14:38.288
------------------------------
[BeforeSuite] PASSED [5.606 seconds]
[BeforeSuite] 
/src/test/e2e/e2e_suite_test.go:66

  Begin Captured GinkgoWriter Output >>
    [BeforeSuite] TOP-LEVEL
      /src/test/e2e/e2e_suite_test.go:66
    STEP: Using test namespace test-stash-xifp4i 03/04/23 14:14:38.27
    STEP: Deploy TLS secured Minio Server 03/04/23 14:14:38.288
  << End Captured GinkgoWriter Output
------------------------------
Snapshot Tests Backend Minio
  should successfully perform Snapshot operations
  /src/test/e2e/misc/snapshots.go:179
[BeforeEach] Snapshot Tests
  /src/test/e2e/misc/snapshots.go:43
[It] should successfully perform Snapshot operations
  /src/test/e2e/misc/snapshots.go:179
STEP: Creating PVC: test-stash-xifp4i/source-volume-stash-e2e-rh7kyx 03/04/23 14:14:43.622
STEP: Deploying Deployment: source-dp-stash-e2e-rh7kyx-ih01w7 03/04/23 14:14:43.632
STEP: Waiting for Deployment to be ready 03/04/23 14:14:43.705
STEP: Generating sample data inside workload pods 03/04/23 14:14:49.575
STEP: Verifying that sample data has been generated 03/04/23 14:14:49.666
STEP: Deploying StatefulSet: source-ss-stash-e2e-rh7kyx 03/04/23 14:14:49.745
STEP: Waiting for StatefulSet to be ready 03/04/23 14:14:49.767
STEP: Generating sample data inside workload pods 03/04/23 14:15:03.254
STEP: Verifying that sample data has been generated 03/04/23 14:15:03.462
STEP: Creating Storage Secret 03/04/23 14:15:03.667
STEP: Creating Repository 03/04/23 14:15:03.7
STEP: Creating Storage Secret 03/04/23 14:15:03.9
STEP: Creating Repository 03/04/23 14:15:03.978
STEP: Creating BackupConfiguration for the workloads 03/04/23 14:15:04.135
STEP: Verifying that sidecar has been injected 03/04/23 14:15:04.721
STEP: Waiting for Deployment to be ready with sidecar 03/04/23 14:15:09.256
STEP: Verifying that sidecar has been injected 03/04/23 14:15:49.469
STEP: Waiting for StatefulSet to be ready with sidecar 03/04/23 14:15:49.475
STEP: Triggering an instant backup for the workloads 03/04/23 14:16:27.658
STEP: Waiting for the backup processes to complete 03/04/23 14:16:27.766
STEP: Verifying that the backup process has succeeded for the workloads 03/04/23 14:16:51.927
STEP: Listing all snapshots 03/04/23 14:16:51.934
STEP: Get a particular snapshot 03/04/23 14:16:54.081
STEP: Filter by repository name 03/04/23 14:16:57.326
STEP: Filter by hostname 03/04/23 14:16:58.47
STEP: Filter by negated selector 03/04/23 14:17:00.484
STEP: Filter by set based selector 03/04/23 14:17:01.571
STEP: Deleting snapshot minio-stash-e2e-rh7kyx-cqwcer-1a66b4c9 03/04/23 14:17:05.753
STEP: Checking deleted snapshot not exist 03/04/23 14:17:08.629
[JustAfterEach] Snapshot Tests
  /src/test/e2e/misc/snapshots.go:46
[AfterEach] Snapshot Tests
  /src/test/e2e/misc/snapshots.go:49
STEP: Cleaning Test Resources 03/04/23 14:17:10.285
------------------------------
• [SLOW TEST] [184.612 seconds]
Snapshot Tests
/src/test/e2e/misc/snapshots.go:40
  Backend
  /src/test/e2e/misc/snapshots.go:177
    Minio
    /src/test/e2e/misc/snapshots.go:178
      should successfully perform Snapshot operations
      /src/test/e2e/misc/snapshots.go:179

  Begin Captured GinkgoWriter Output >>
    [BeforeEach] Snapshot Tests
      /src/test/e2e/misc/snapshots.go:43
    [It] should successfully perform Snapshot operations
      /src/test/e2e/misc/snapshots.go:179
    STEP: Creating PVC: test-stash-xifp4i/source-volume-stash-e2e-rh7kyx 03/04/23 14:14:43.622
    STEP: Deploying Deployment: source-dp-stash-e2e-rh7kyx-ih01w7 03/04/23 14:14:43.632
    STEP: Waiting for Deployment to be ready 03/04/23 14:14:43.705
    STEP: Generating sample data inside workload pods 03/04/23 14:14:49.575
    STEP: Verifying that sample data has been generated 03/04/23 14:14:49.666
    STEP: Deploying StatefulSet: source-ss-stash-e2e-rh7kyx 03/04/23 14:14:49.745
    STEP: Waiting for StatefulSet to be ready 03/04/23 14:14:49.767
    STEP: Generating sample data inside workload pods 03/04/23 14:15:03.254
    STEP: Verifying that sample data has been generated 03/04/23 14:15:03.462
    STEP: Creating Storage Secret 03/04/23 14:15:03.667
    STEP: Creating Repository 03/04/23 14:15:03.7
    STEP: Creating Storage Secret 03/04/23 14:15:03.9
    STEP: Creating Repository 03/04/23 14:15:03.978
    STEP: Creating BackupConfiguration for the workloads 03/04/23 14:15:04.135
    STEP: Verifying that sidecar has been injected 03/04/23 14:15:04.721
    STEP: Waiting for Deployment to be ready with sidecar 03/04/23 14:15:09.256
    STEP: Verifying that sidecar has been injected 03/04/23 14:15:49.469
    STEP: Waiting for StatefulSet to be ready with sidecar 03/04/23 14:15:49.475
    STEP: Triggering an instant backup for the workloads 03/04/23 14:16:27.658
    STEP: Waiting for the backup processes to complete 03/04/23 14:16:27.766
    STEP: Verifying that the backup process has succeeded for the workloads 03/04/23 14:16:51.927
    STEP: Listing all snapshots 03/04/23 14:16:51.934
    STEP: Get a particular snapshot 03/04/23 14:16:54.081
    STEP: Filter by repository name 03/04/23 14:16:57.326
    STEP: Filter by hostname 03/04/23 14:16:58.47
    STEP: Filter by negated selector 03/04/23 14:17:00.484
    STEP: Filter by set based selector 03/04/23 14:17:01.571
    STEP: Deleting snapshot minio-stash-e2e-rh7kyx-cqwcer-1a66b4c9 03/04/23 14:17:05.753
    STEP: Checking deleted snapshot not exist 03/04/23 14:17:08.629
    [JustAfterEach] Snapshot Tests
      /src/test/e2e/misc/snapshots.go:46
    [AfterEach] Snapshot Tests
      /src/test/e2e/misc/snapshots.go:49
    STEP: Cleaning Test Resources 03/04/23 14:17:10.285
  << End Captured GinkgoWriter Output
------------------------------
sgielen commented 1 year ago

After seeing that the end-to-end test completed against MinIO, I replaced the restic binary on my test environment with v0.15.1. With that binary, a backup that before took a long time and used a lot of memory, occasionally even failing with an out-of-memory error, now succeeded quite quickly. So it appears that upgrading to Restic v0.15.1 is indeed a great improvement for this issue.

hmsayem commented 1 year ago

Hello @sgielen. Thank you for reporting the issue. We need to migrate our existing repositories to the latest repository version because of the new compression support added in restic 0.14.0. This migration may take some additional work before we can upgrade the restic version.