Issues with Pausing and Resuming Backup

scylladb / scylla-manager

The Scylla Manager

https://manager.docs.scylladb.com/stable/

Other

48 stars 33 forks source link

Issues with Pausing and Resuming Backup #3881

Closed Surrina-ki closed 2 weeks ago

Surrina-ki commented 3 weeks ago

I encountered some issues while testing the pause and resume functionality of Scylla Manager's backup operations. First, I started a backup task using the command. Then, I paused the task using the stop command. When I attempted to resume the task using the start command, the status quickly changed to DONE, but the progress did not reach 100%.

I then tried using suspend and resume, but the resume did not take effect.

Michal-Leszczynski commented 2 weeks ago

Hi @Surrina-ki, thanks for reporting!

The strange thing is that at the stage MOVE MANIFEST manager should already uploaded all files, so all tables should have 100% progress or some errors. So this looks like either progress display issue, or a size calculation issue.

Could you send manager logs so that I can validate, what happened?

Surrina-ki commented 2 weeks ago

The strange thing is that at the stage MOVE MANIFEST manager should already uploaded all files, so all tables should have 100% progress or some errors. So this looks like either progress display issue, or a size calculation issue.

It's not just a display issue. Although it shows "DONE," when I use it to restore data, it reports an error.

Could you send manager logs so that I can validate, what happened?

Here's the log that appeared when I executed the following command: sctool backup -c test -L 's3:test' -K 'test_keyspace2' --rate-limit 100 sctool -c test stop backup/1f2c7378-fd23-46c1-82b3-6528b5f27f86 sctool -c test progress backup/1f2c7378-fd23-46c1-82b3-6528b5f27f86

Then I executed the start command:

Michal-Leszczynski commented 2 weeks ago

What SM version are you using? If it's older than 3.2.6, then I suspect that this is because of https://github.com/scylladb/scylla-manager/issues/3729 which was fixed in SM 3.2.6. If that's the case, please update SM to the newest version and see if errors during upload stage are reported in a correct way.

Surrina-ki commented 2 weeks ago

What SM version are you using? If it's older than 3.2.6, then I suspect that this is because of #3729 which was fixed in SM 3.2.6. If that's the case, please update SM to the newest version and see if errors during upload stage are reported in a correct way.

Thank you. I was using version 3.2.3. After switching to version 3.2.8, I was able to resume backups normally after stopping and starting.

Surrina-ki commented 2 weeks ago

@Michal-Leszczynski I also want to ask, there are many logs like the one below in the agent logs. Could this cause any issues? {"L":"INFO","T":"2024-06-13T06:52:42.008Z","M":"http: TLS handshake error from 192.168.100.4:50596: EOF"} {"L":"INFO","T":"2024-06-13T06:52:42.008Z","M":"http: TLS handshake error from 192.168.100.4:62548: read tcp 192.168.100.100:10001->192.168.100.4:62548: read: connection reset by peer"}

Michal-Leszczynski commented 2 weeks ago

They can happen in some fragments of the logs. They are probably connected to some temporary connectivity/infrastructure issues. You don't need to worry about them, as if they broke something, it would also be visible in other, more specific error messages.