stashed / stash

🛅 Backup your Kubernetes Stateful Applications
https://stash.run
Other
1.32k stars 86 forks source link

Stash randomly fails with `already locked` #1332

Open cwrau opened 3 years ago

cwrau commented 3 years ago

We regularly get the following errors during our backups;

/bin/restic check --cache-dir /tmp/restic-cache
unable to create lock in backend: repository is already locked by PID 26219 on stash-stash-community-c7cc7fd6d-wr4k4
lock was created at 2021-03-23 00:05:52 (311.046091ms ago)

When I check for locks, there aren't any.

We don't access stash manually, not at that time, and especially not using something that creates restic locks, so the only thing that can create these locks is stash itself

hossainemruz commented 3 years ago

Please check this wiki to understand possible reason and workaround: https://github.com/stashed/project/wiki/Repository-Get-Locked

cwrau commented 3 years ago

I looked at that, but the pods haven't been restarted during the backup and I don't think the temp-dir is too slow, as we have 10 Jobs running concurrently and only sometimes a single random job fails

cwrau commented 3 years ago

This happened again today for another backup job, can I give you something to better analyse this?

hossainemruz commented 3 years ago

Can you please share the log from the backup job?

cwrau commented 3 years ago

Sure; backup.log

hossainemruz commented 3 years ago

Please use --all-containers flag to get the full log. For example, kubectl logs -n <namespace> <pod name> --all-containers.

cwrau commented 3 years ago

I'm sorry, totally forgot about that, here it is; backup.log

hossainemruz commented 3 years ago

This line tells that the Repository has been locked by the Stash operator pod. Is there any possibility that someone was trying to list Snapshot while the backup was running.

[pod/stash-backup-customer-4ap-prod-4allportal-assets-161654408nkzh/update-status-1] 2021-03-24T00:00:38.445701659Z unable to create lock in backend: repository is already locked by PID 27700 on stash-stash-community-c7cc7fd6d-wr4k4 by  (UID 0, GID 0)
cwrau commented 3 years ago

Except for stash itself, no

hossainemruz commented 3 years ago

Can you please share the YAML for BackupConfiguration and Repository ?

cwrau commented 3 years ago

Sure; r.txt bc.txt

hossainemruz commented 3 years ago

Those YAMLs are looking good. @cwrau What version of Stash are you using?

cwrau commented 3 years ago

We're using appscode/stash:v0.12.0 via version 2021.03.17 of the Helm Chart

hossainemruz commented 3 years ago

Can you share the YAML of backup job?

cwrau commented 3 years ago

Do you mean the kubernetes job itself? That job has already been deleted 😕

But I can send you the next failing one

hossainemruz commented 3 years ago

So, this issue only happen sometimes, not always?

cwrau commented 3 years ago

Yes, not every night and not the same job every time

hossainemruz commented 3 years ago

That's interesting. Then, the issue should not be related to any miscommunication. It might be something else.

cwrau commented 3 years ago

Today another job failed here is its yaml; job.txt

hossainemruz commented 3 years ago

Can you please share the log from Stash Operator

cwrau commented 3 years ago

Stash logs wayyy too much, can I filter for something?

Just 20 minutes of log are 4MiB and >5000 lines of log

cwrau commented 3 years ago

It happened again today, logs and yamls are similar

Can I filter the operator log for something?

cwrau commented 3 years ago

Today another two backups failed, how can I assist you with this issue?

cwrau commented 3 years ago

This is still happening, again today with two of our backups

cwrau commented 3 years ago

Happened again today, what can I do to help?