Closed fgelcer closed 1 year ago
seen in other jobs as well, but so far, all of them on eu-north-1
@ShlomiBalalis FYI
also, i'm trying to reproduce this issue, using the sct_feature
job copied from manager-master here
@mmatczuk , any idea on why it is happening?
From what I can see in the logs the files are being uploaded and then the 404 occurs. One possibility is that the files were moved, is that possible for snapshots to be moved? The other possibility is that there is a bug in listing, I see something similar here in unit tests I will take a look.
The unit test thing is stable I ran stress test on it locally with 200 runs no failures. Adding table columns or removing the table seems to have no effect on the snapshot (confirmed by @denesb). It looks the only possibility here is either
@fgelcer one thing I notice is that you are running mismatched versions
Oct 18 13:20:17 longevity-200gb-48h-verify-limited--monitor-node-5db72bed-1 scylla-manager[18920]: {"L":"INFO","T":"2021-10-18T13:20:17.521Z","M":"Scylla Manager Server","version":"2.6-dev-0.20211017.463bec783d0","pid":18920,"_trace_id":"y1Kph9jSSUCw3QgPs-os-Q"}
vs
2021-10-18T13:07:09+00:00 longevity-200gb-48h-verify-limited--db-node-5db72bed-3 !INFO | scylla-manager-agent[15802]: {"L":"INFO","T":"2021-10-18T13:07:09.498Z","M":"Scylla Manager Agent","version":"666.dev-0.20210518.f85e9296","pid":15802,"_trace_id":"nk5Rt-WJSM2_kvWhIwt0Kg"}
which results in some invalid calls
2021-10-18T14:31:41+00:00 longevity-200gb-48h-verify-limited--db-node-5db72bed-3 !INFO | scylla-manager-agent[15965]: {"L":"ERROR","T":"2021-10-18T14:31:41.864Z","N":"rclone","M":"SECURITY call to unexported endpoint [path=sync/move
dir, ip=10.0.2.14:44928]","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:84\ngithub.com/scylladb/scyl
la-manager/pkg/rclone.RedirectLogPrint.func1\n\tgithub.com/scylladb/scylla-manager/pkg/rclone/logger.go:19\ngithub.com/rclone/rclone/fs.LogPrintf\n\tgithub.com/rclone/rclone@v1.51.0/fs/log.go:152\ngithub.com/rclone/rclone/fs.Errorf\n\tg
ithub.com/rclone/rclone@v1.51.0/fs/log.go:167\ngithub.com/scylladb/scylla-manager/pkg/rclone/rcserver.Server.ServeHTTP\n\tgithub.com/scylladb/scylla-manager/pkg/rclone/rcserver/rcserver.go:214\nnet/http.StripPrefix.func1\n\tnet/http/ser
ver.go:2112\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*Mux).Mount.func1\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:311\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-
chi/chi/v5.(*Mux).routeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:436\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:70\ngithub.com/go-
chi/chi/v5.(*Mux).Mount.func1\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:311\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/scylladb/scylla-manager/pkg/auth.ValidateToken.func1.1\n\tgithub.com/scylladb/scylla-mana
ger/pkg/auth/auth.go:49\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*ChainHandler).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/chain.go:31\ngithub.com/go-chi/chi/v5.(*Mux).routeHTTP\n\tgithub.c
om/go-chi/chi/v5@v5.0.0/mux.go:436\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:57\nnet/http.HandlerFunc.Se
rveHTTP\n\tnet/http/server.go:2069\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:87\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2887\nnet/http.(*conn).serve\n\tnet/http/server.go:1952"}
2021-10-18T14:31:41+00:00 longevity-200gb-48h-verify-limited--db-node-5db72bed-3 !INFO | scylla-manager-agent[15965]: {"L":"INFO","T":"2021-10-18T14:31:41.865Z","N":"http","M":"POST /agent/rclone/sync/movedir?_async=true","from":"10
.0.2.14:44928","status":404,"bytes":37,"duration":"0ms"}
@fgelcer it seems that it IS the issue
{"L":"ERROR","T":"2021-10-18T14:51:26.280Z","N":"rclone","M":"SECURITY call to unexported endpoint [path=sync/movedir, ip=10.0.2.14:48614]","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.6/logger.go:1>
{"L":"INFO","T":"2021-10-18T14:51:26.280Z","N":"http","M":"POST /agent/rclone/sync/movedir?_async=true","from":"10.0.2.14:48614","status":404,"bytes":37,"duration":"0ms"}
@ShlomiBalalis , please find why we have a mismatch in the versions between SM and the agents
The problem we don't use manager unified build, but the old build path, which wasn't updated for a while. I was about to change it to the unified build, but I have a small issue with it - https://github.com/scylladb/scylla-pkg/issues/2489
Once this issue closes I'll make the change
The problem we don't use manager unified build, but the old build path, which wasn't updated for a while. I was about to change it to the unified build, but I have a small issue with it - scylladb/scylla-pkg#2489
Once this issue closes I'll make the change
@mmatczuk @ShlomiBalalis @yaronkaikov , the issue mentioned here was closed some 12 days ago, and a test by @soyacz was started some 4 days ago, and it has similar issues to this issue here...
@soyacz , can you please attach here logs, so they can check if there are other possible issues around it?
Yes, my errors was similar, but my branch is older than 12 days and I test against 4.5.1. Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/db-cluster-4506f7c8.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/loader-set-4506f7c8.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/monitor-set-4506f7c8.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/4506f7c8-dbd4-4f2c-a880-9f467c5929af/20211125_190738/sct-runner-4506f7c8.tar.gz
versions found in logs:
l:742 c:sdcm.mgmt.cli p:INFO > Initiating Scylla-Manager, version: [['Client version: 2.6-dev-0.20211115.6fbf87ea9e8-SNAPSHOT-6fbf87ea9e8'], ['Server version: 2.6-dev-0.20211115.6fbf87ea9e8-SNAPSHOT-6fbf87ea9e8']] p:INFO > Node longevity-many-clients-4h-longevit-db-node-4506f7c8-3 [18.202.55.68 | 10.0.1.33] (seed: False): node longevity-many-clients-4h-longevit-db-node-4506f7c8-3 has scylla-manager-agent version 666.dev-0.20210518.f85e9296
and log line with the issue:
10.0.2.236: keyspace1.standard1: upload snapshot: copy "data:keyspace1/standard1-d37a65704deb11ec98f1474b5a6a2571/snapshots/sm_20211125125501UTC" to "s3:manager-backup-tests-us-east-1/backup/sst/cluster/8a2bf3e0-5da7-4e4a-876e-576fd013b27d/dc/eu-west/node/cc1eee08-b353-4020-a874-b4db60aafd47/keyspace/keyspace1/table/standard1/d37a65704deb11ec98f1474b5a6a2571": agent [HTTP 404] Not found;
Last time you had a version mismatch and you were hitting not existing endpoints thus 404.
The problem we don't use manager unified build, but the old build path, which wasn't updated for a while. I was about to change it to the unified build, but I have a small issue with it - scylladb/scylla-pkg#2489
Once this issue closes I'll make the change
@ShlomiBalalis , seems like something is still broken, could you check please?
@fgelcer from the NEW logs, check the DATE.
2021-11-25T12:18:20+00:00 longevity-many-clients-4h-longevit-db-node-4506f7c8-2 !INFO | scylla-manager-agent[20283]: {"L":"INFO","T":"2021-11-25T12:18:20.521Z","M":"Scylla Manager Agent","version":"666.dev-0.20210518.f85e9296","pid"
:20283,"_trace_id":"025oCeQISw6zN4jjvTvuZw"}
@mmatczuk , what is there with the date? it ran 5 days ago, and when it ran, the fix on scylladb/scylla-pkg#2489 was supposed to be fixed already, no?
The version is 666.dev-0.20210518.f85e9296
@soyacz Are you rebased with master? The fix was not backported: https://github.com/scylladb/scylla-cluster-tests/pull/4099/
@soyacz was not rebased. I believe we can close this issue.
when running on region
eu-north-1
and the bucket is onus-east-1
. test where seeing it now is onlongevity-200gb-48h
and the backup nemesis ran twice (and twice failed with the same error).from the manager logs:
full logs can be found here: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/5db72bed-ee4c-4071-842d-5ba37f9c73a4/20211020_145519/db-cluster-5db72bed.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/5db72bed-ee4c-4071-842d-5ba37f9c73a4/20211020_145519/monitor-set-5db72bed.tar.gz