Open mikliapko opened 1 week ago
In general the restore task does not fail directly because of not enough disk space
because we implemented batch and host re-tries some time ago. This means that the failed node won't participate in the restore anymore, and that the failed batch will be restored by other nodes. The problem is that the node without disk space will still cause problems for other nodes, and the restore will fail because of std::runtime_error (send_meta_data: got error code=-1 from node=10.138.0.37)
.
We should improve error handling to reduce the amount of re-try related errors in the logs. We could also work on improving the error message, but we won't do it for the 3.4 release. But none of those things will be included in the 3.4 release and they are not that urgent in general.
Alright, I will disable failing error message validation until we improve logging approach
Trying to run restore operation when one of the cluster nodes reached enospc.
For previous Manager 3.3 sctool progress was returning:
with direct error indication
create run progress: validate free disk space: not enough disk space
.The latest Manager (3.4 release candidate) returns a bit different cause
failed to restore sstables from location gcs:manager-backup-tests-sct-project-1-us-east1 table keyspace1.standard1 (993831798 bytes). See logs for more info
that fails one of SCT tests:Argus run.
@Michal-Leszczynski I suppose the new error message looks good as well. Just want to make sure it was intentional change rather than unexpected refactoring side-effect or whatever. Please, let me know if it's by purpose change and I'll adjust SCT test accordingly.