scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

backup consistently failing with HTTP 404 #2893

Closed fgelcer closed 1 year ago

fgelcer commented 3 years ago

when running on region eu-north-1 and the bucket is on us-east-1. test where seeing it now is on longevity-200gb-48h and the backup nemesis ran twice (and twice failed with the same error).

from the manager logs:

Oct 18 14:31:41 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"INFO","T":"2021-10-18T14:31:41.864Z","N":"backup.upload","M":"Uploading table snapshot","host":"10.0.1.167","keyspace":"system_schema","table":"types","location":"s3:manager-backup-tests-us-east-1","_trace_id":"lhKFZ5ZuTfmZuJ4jUbBbyg"}
Oct 18 14:31:41 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"INFO","T":"2021-10-18T14:31:41.871Z","N":"backup.upload","M":"Uploading table snapshot","host":"10.0.1.167","keyspace":"system_schema","table":"triggers","location":"s3:manager-backup-tests-us-east-1","_trace_id":"lhKFZ5ZuTfmZuJ4jUbBbyg"}
Oct 18 14:31:41 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"INFO","T":"2021-10-18T14:31:41.873Z","N":"backup.upload","M":"Uploading table snapshot","host":"10.0.1.167","keyspace":"system_schema","table":"computed_columns","location":"s3:manager-backup-tests-us-east-1","_trace_id":"lhKFZ5ZuTfmZuJ4jUbBbyg"}
Oct 18 14:31:41 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"ERROR","T":"2021-10-18T14:31:41.874Z","N":"backup.upload","M":"Uploading snapshot files failed on host","host":"10.0.1.167","error":"keyspace1.standard1: upload snapshot: copy \"data:keyspace1/standard1-4b742780301611ec8e296a5ec176e97a/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/keyspace1/table/standard1/4b742780301611ec8e296a5ec176e97a\": agent [HTTP 404] Not found; system_schema.columns: upload snapshot: copy \"data:system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/columns/24101c25a2ae3af787c1b40ee1aca33f\": agent [HTTP 404] Not found; system_schema.tables: upload snapshot: copy \"data:system_schema/tables-afddfb9dbc1e30688056eed6c302ba09/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/tables/afddfb9dbc1e30688056eed6c302ba09\": agent [HTTP 404] Not found; system_schema.scylla_tables: upload snapshot: copy \"data:system_schema/scylla_tables-5d912ff1f7593665b2c88042ab5103dd/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/scylla_tables/5d912ff1f7593665b2c88042ab5103dd\": agent [HTTP 404] Not found; system_schema.keyspaces: upload snapshot: copy \"data:system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/keyspaces/abac5682dea631c5b535b3d6cffd0fb6\": agent [HTTP 404] Not found; system_schema.views: upload snapshot: copy \"data:system_schema/views-9786ac1cdd583201a7cdad556410c985/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/views/9786ac1cdd583201a7cdad556410c985\": agent [HTTP 404] Not found; system_schema.dropped_columns: upload snapshot: copy \"data:system_schema/dropped_columns-5e7583b5f3f43af19a39b7e1d6f5f11f/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/dropped_columns/5e7583b5f3f43af19a39b7e1d6f5f11f\": agent [HTTP 404] Not found; system_schema.functions: upload snapshot: copy \"data:system_schema/functions-96489b7980be3e14a70166a0b9159450/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/functions/96489b7980be3e14a70166a0b9159450\": agent [HTTP 404] Not found; system_schema.aggregates: upload snapshot: copy \"data:system_schema/aggregates-924c55872e3a345bb10c12f37c1ba895/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/aggregates/924c55872e3a345bb10c12f37c1ba895\": agent [HTTP 404] Not found; system_schema.indexes: upload snapshot: copy \"data:system_schema/indexes-0feb57ac311f382fba6d9024d305702f/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/indexes/0feb57ac311f382fba6d9024d305702f\": agent [HTTP 404] Not found; system_schema.view_virtual_columns: upload snapshot: copy \"data:system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/view_virtual_columns/08843b6345dc3be29798a0418295cfaa\": agent [HTTP 404] Not found; system_schema.types: upload snapshot: copy \"data:system_schema/types-5a8b1ca866023f77a0459273d308917a/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/types/5a8b1ca866023f77a0459273d308917a\": agent [HTTP 404] Not found; system_schema.triggers: upload snapshot: copy \"data:system_schema/triggers-4df70b666b05325195a132b54005fd48/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/triggers/4df70b666b05325195a132b54005fd48\": agent [HTTP 404] Not found; system_schema.computed_columns: upload snapshot: copy \"data:system_schema/computed_columns-cc7c7069374033c192a4c3de78dbd2c4/snapshots/sm_20211018143125UTC\" to \"s3:manager-backup-tests-us-east-1/backup/sst/cluster/e1277202-c653-4c34-aa49-ae64f0fcc11f/dc/eu-north/node/04976835-2be2-4635-8264-db8b05e9a033/keyspace/system_schema/table/computed_columns/cc7c7069374033c192a4c3de78dbd2c4\": agent [HTTP 404] Not found","_trace_id":"lhKFZ5ZuTfmZuJ4jUbBbyg","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:84\ngithub.com/scylladb/scylla-manager/pkg/service/backup.(*worker).Upload.func2\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/worker_upload.go:29\ngithub.com/scylladb/scylla-manager/pkg/service/backup.hostsInParallel.func1\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/parallel.go:80\ngithub.com/scylladb/scylla-manager/pkg/util/parallel.Run.func1\n\tgithub.com/scylladb/scylla-manager/pkg/util/parallel/parallel.go:69"}
Oct 18 14:31:41 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"ERROR","T":"2021-10-18T14:31:41.874Z","N":"backup.upload","M":"Uploading snapshot files failed see exact errors above","duration":"32.367476ms","_trace_id":"lhKFZ5ZuTfmZuJ4jUbBbyg","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:84\ngithub.com/scylladb/scylla-manager/pkg/service/backup.(*worker).Upload.func1\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/worker_upload.go:20\ngithub.com/scylladb/scylla-manager/pkg/service/backup.(*worker).Upload\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/worker_upload.go:26\ngithub.com/scylladb/scylla-manager/pkg/service/backup.(*Service).Backup.func8\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/service.go:764\ngithub.com/scylladb/scylla-manager/pkg/service/backup.(*Service).Backup.func11\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/service.go:808\ngithub.com/scylladb/scylla-manager/pkg/service/backup.(*Service).Backup\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/service.go:812\ngithub.com/scylladb/scylla-manager/pkg/service/backup.Runner.Run\n\tgithub.com/scylladb/scylla-manager/pkg/service/backup/runner.go:26\ngithub.com/scylladb/scylla-manager/pkg/service/scheduler.PolicyRunner.Run\n\tgithub.com/scylladb/scylla-manager/pkg/service/scheduler/policy.go:32\ngithub.com/scylladb/scylla-manager/pkg/service/scheduler.(*Service).run.func2\n\tgithub.com/scylladb/scylla-manager/pkg/service/scheduler/service.go:386"}

full logs can be found here: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/5db72bed-ee4c-4071-842d-5ba37f9c73a4/20211020_145519/db-cluster-5db72bed.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/5db72bed-ee4c-4071-842d-5ba37f9c73a4/20211020_145519/monitor-set-5db72bed.tar.gz

fgelcer commented 3 years ago

seen in other jobs as well, but so far, all of them on eu-north-1

fgelcer commented 3 years ago

@ShlomiBalalis FYI

fgelcer commented 3 years ago

also, i'm trying to reproduce this issue, using the sct_feature job copied from manager-master here

fgelcer commented 3 years ago

@mmatczuk , any idea on why it is happening?

mmatczuk commented 3 years ago

From what I can see in the logs the files are being uploaded and then the 404 occurs. One possibility is that the files were moved, is that possible for snapshots to be moved? The other possibility is that there is a bug in listing, I see something similar here in unit tests I will take a look.

mmatczuk commented 3 years ago

The unit test thing is stable I ran stress test on it locally with 200 runs no failures. Adding table columns or removing the table seems to have no effect on the snapshot (confirmed by @denesb). It looks the only possibility here is either

mmatczuk commented 3 years ago

@fgelcer one thing I notice is that you are running mismatched versions

Oct 18 13:20:17 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"INFO","T":"2021-10-18T13:20:17.521Z","M":"Scylla Manager Server","version":"2.6-dev-0.20211017.463bec783d0","pid":18920,"_trace_id":"y1Kph9jSSUCw3QgPs-os-Q"}

vs

2021-10-18T13:07:09+00:00  longevity-200gb-48h-verify-limited--db-node-5db72bed-3 !INFO    | scylla-manager-agent[15802]: {"L":"INFO","T":"2021-10-18T13:07:09.498Z","M":"Scylla Manager Agent","version":"666.dev-0.20210518.f85e9296","pid":15802,"_trace_id":"nk5Rt-WJSM2_kvWhIwt0Kg"}

which results in some invalid calls

2021-10-18T14:31:41+00:00  longevity-200gb-48h-verify-limited--db-node-5db72bed-3 !INFO    | scylla-manager-agent[15965]: {"L":"ERROR","T":"2021-10-18T14:31:41.864Z","N":"rclone","M":"SECURITY call to unexported endpoint [path=sync/move
dir, ip=10.0.2.14:44928]","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:84\ngithub.com/scylladb/scyl
la-manager/pkg/rclone.RedirectLogPrint.func1\n\tgithub.com/scylladb/scylla-manager/pkg/rclone/logger.go:19\ngithub.com/rclone/rclone/fs.LogPrintf\n\tgithub.com/rclone/rclone@v1.51.0/fs/log.go:152\ngithub.com/rclone/rclone/fs.Errorf\n\tg
ithub.com/rclone/rclone@v1.51.0/fs/log.go:167\ngithub.com/scylladb/scylla-manager/pkg/rclone/rcserver.Server.ServeHTTP\n\tgithub.com/scylladb/scylla-manager/pkg/rclone/rcserver/rcserver.go:214\nnet/http.StripPrefix.func1\n\tnet/http/ser
ver.go:2112\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*Mux).Mount.func1\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:311\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-
chi/chi/v5.(*Mux).routeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:436\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:70\ngithub.com/go-
chi/chi/v5.(*Mux).Mount.func1\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:311\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/scylladb/scylla-manager/pkg/auth.ValidateToken.func1.1\n\tgithub.com/scylladb/scylla-mana
ger/pkg/auth/auth.go:49\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*ChainHandler).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/chain.go:31\ngithub.com/go-chi/chi/v5.(*Mux).routeHTTP\n\tgithub.c
om/go-chi/chi/v5@v5.0.0/mux.go:436\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:57\nnet/http.HandlerFunc.Se
rveHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:87\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2887\nnet/http.(*conn).serve\n\tnet/http/server.go:1952"}
2021-10-18T14:31:41+00:00  longevity-200gb-48h-verify-limited--db-node-5db72bed-3 !INFO    | scylla-manager-agent[15965]: {"L":"INFO","T":"2021-10-18T14:31:41.865Z","N":"http","M":"POST /agent/rclone/sync/movedir?_async=true","from":"10
.0.2.14:44928","status":404,"bytes":37,"duration":"0ms"}
mmatczuk commented 3 years ago

@fgelcer it seems that it IS the issue

{"L":"ERROR","T":"2021-10-18T14:51:26.280Z","N":"rclone","M":"SECURITY call to unexported endpoint [path=sync/movedir, ip=10.0.2.14:48614]","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:1>
{"L":"INFO","T":"2021-10-18T14:51:26.280Z","N":"http","M":"POST /agent/rclone/sync/movedir?_async=true","from":"10.0.2.14:48614","status":404,"bytes":37,"duration":"0ms"}
fgelcer commented 3 years ago

@ShlomiBalalis , please find why we have a mismatch in the versions between SM and the agents

ShlomiBalalis commented 3 years ago

The problem we don't use manager unified build, but the old build path, which wasn't updated for a while. I was about to change it to the unified build, but I have a small issue with it - https://github.com/scylladb/scylla-pkg/issues/2489

Once this issue closes I'll make the change

fgelcer commented 2 years ago

The problem we don't use manager unified build, but the old build path, which wasn't updated for a while. I was about to change it to the unified build, but I have a small issue with it - scylladb/scylla-pkg#2489

Once this issue closes I'll make the change

@mmatczuk @ShlomiBalalis @yaronkaikov , the issue mentioned here was closed some 12 days ago, and a test by @soyacz was started some 4 days ago, and it has similar issues to this issue here...

@soyacz , can you please attach here logs, so they can check if there are other possible issues around it?

soyacz commented 2 years ago

Yes, my errors was similar, but my branch is older than 12 days and I test against 4.5.1. Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/db-cluster-4506f7c8.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/loader-set-4506f7c8.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/monitor-set-4506f7c8.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/sct-runner-4506f7c8.tar.gz

versions found in logs: l:742 c:sdcm.mgmt.cli p:INFO > Initiating Scylla-Manager, version: [['Client version: 2.6-dev-0.20211115.6fbf87ea9e8-SNAPSHOT-6fbf87ea9e8'], ['Server version: 2.6-dev-0.20211115.6fbf87ea9e8-SNAPSHOT-6fbf87ea9e8']] p:INFO > Node longevity-many-clients-4h-longevit-db-node-4506f7c8-3 [18.202.55.68 | 10.0.1.33] (seed: False): node longevity-many-clients-4h-longevit-db-node-4506f7c8-3 has scylla-manager-agent version 666.dev-0.20210518.f85e9296

and log line with the issue: 10.0.2.236: keyspace1.standard1: upload snapshot: copy "data:keyspace1/standard1-d37a65704deb11ec98f1474b5a6a2571/snapshots/sm_20211125125501UTC" to "s3:manager-backup-tests-us-east-1/backup/sst/cluster/8a2bf3e0-5da7-4e4a-876e-576fd013b27d/dc/eu-west/node/cc1eee08-b353-4020-a874-b4db60aafd47/keyspace/keyspace1/table/standard1/d37a65704deb11ec98f1474b5a6a2571": agent [HTTP 404] Not found;

mmatczuk commented 2 years ago

Last time you had a version mismatch and you were hitting not existing endpoints thus 404.

fgelcer commented 2 years ago

The problem we don't use manager unified build, but the old build path, which wasn't updated for a while. I was about to change it to the unified build, but I have a small issue with it - scylladb/scylla-pkg#2489

Once this issue closes I'll make the change

@ShlomiBalalis , seems like something is still broken, could you check please?

mmatczuk commented 2 years ago

@fgelcer from the NEW logs, check the DATE.

2021-11-25T12:18:20+00:00  longevity-many-clients-4h-longevit-db-node-4506f7c8-2 !INFO    | scylla-manager-agent[20283]: {"L":"INFO","T":"2021-11-25T12:18:20.521Z","M":"Scylla Manager Agent","version":"666.dev-0.20210518.f85e9296","pid"
:20283,"_trace_id":"025oCeQISw6zN4jjvTvuZw"}
fgelcer commented 2 years ago

@mmatczuk , what is there with the date? it ran 5 days ago, and when it ran, the fix on scylladb/scylla-pkg#2489 was supposed to be fixed already, no?

mmatczuk commented 2 years ago

The version is 666.dev-0.20210518.f85e9296

ShlomiBalalis commented 2 years ago

@soyacz Are you rebased with master? The fix was not backported: https://github.com/scylladb/scylla-cluster-tests/pull/4099/

ShlomiBalalis commented 2 years ago

@soyacz was not rebased. I believe we can close this issue.