Alternator new keyspace names is breaking all kind of nodetool/manager operations

fruch commented 4 years ago

Installation details Scylla version (or git commit hash): 666.development-0.20200222.4e95b67501c Cluster size: 6 OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-012fcd7ac5f0e1471

Summary

Since the alternator code create a keyspace per table name like a#usertable all kind of notetool operations started failing like this:

DisruptionEvent Severity.ERROR): type=end name=ShowTopPartitions node=Node alternator-3h-alternat-db-node-c76021e2-5 [52.212.138.124 | 10.0.234.176] (seed: False) duration=1 error=Encountered a bad command exit code!

Command: 'nodetool  cfstats '

Exit code: 1

Stdout:

nodetool: Scylla API server HTTP GET to URL '/column_family/metrics/write_latency/moving_average_histogram/a#usertable:usertable' failed: Column family 'a%23usertable:usertable' not found
See 'nodetool help' or 'nodetool help '.

I think the hash is a bit problematic for a keyspace name, and isn't handled correctly across the board.

On nodetool cfstats

DisruptionEvent Severity.ERROR): type=end name=ShowTopPartitions node=Node alternator-3h-alternat-db-node-c76021e2-5 [52.212.138.124 | 10.0.234.176] (seed: False) duration=1 error=Encountered a bad command exit code!

Command: 'nodetool cfstats '

Exit code: 1

Stdout:

nodetool: Scylla API server HTTP GET to URL '/column_family/metrics/write_latency/moving_average_histogram/a#usertable:usertable' failed: Column family 'a%23usertable:usertable' not found
See 'nodetool help' or 'nodetool help '.

On nodetool cleanup

DisruptionEvent Severity.ERROR): type=end name=Decommission node=Node alternator-3h-alternat-db-node-c76021e2-2 [3.248.212.66 | 10.0.220.251] (seed: False) duration=460 error=Encountered a bad command exit code!

Command: 'nodetool  cleanup a#usertable'

Exit code: 2

Stdout:

Stderr:

    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

On manager backup call

14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > Command: 'sudo sctool backup -c cbadbfcf-0375-4f68-836c-657ca136f1cb --location s3:manager-backup-tests-eu-west-1'
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > 
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > Exit code: 1
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > 
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > Stdout:
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > 
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > 
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > 
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > Stderr:
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > 
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > Error: failed to create backup target: keyspace a#usertable: get ring description: agent [HTTP 400] Keyspace a%23usertable Does not exist
14:17:05  < t:2020-02-20 12:17:04,773 f:sct_events.py   l:714  c:sdcm.sct_events      p:INFO  > Trace ID: 3saGYMhzTRipb1yF2M8vbg (grep in scylla-manager logs)

psarna commented 4 years ago

My original idea was that # character is not strictly allowed by cql, so cql users won't be able to create alternator keypaces, but it's flawed. Tomorrow I'll simply prepare a patch which changes the prefix to something standard, e.g. "alternator_", and if a cql user creates a keypace with a conflicting name, it will simply be resolved as any other conflict - by refusing to create a keypace that already exists.

fruch commented 4 years ago

Do we also want to lock some other operations on those tables ? like altering it's schema or adding new/remove columns to it ?

nyh commented 4 years ago

There are two aspects to this issue:

First, there appears to be a bug in the REST API code (CC: @amnonh), that deserves a separate issue in the Seastar bug tracker. The "%23" is URL encoding, which a conforming HTTP server is supposed to decode in the appropriate places between URL components (see https://tools.ietf.org/html/rfc3986#section-2.4). The HTTP server should have translated "%23" back to "#" (# is 0x23 ASCII), and the operation should have worked.

Second, given that there are bugs in this area, we should consider again if we really need this special character. I think we don't. I'm in favor of @psarna's patch to change the "a#" prefix to "alternator_". Any prefix whatsoever is already good enough to prevent a DynamoDB API user from colliding with Scylla's system tables and other reserved keyspace names. I don't think it's important to prevent CQL users from reaching DynamoDB tables, especially given that many of our tools are CQL-based.

Finally, even if we change the prefix, note that we also use the ":" and "!" characters in Alternator's GSI tables (materialized views), so if we don't fix the percent-decoding bug in Seastar's HTTP server, we'll probably have problems with those tables too.

nyh commented 4 years ago

I opened a Seastar issue - https://github.com/scylladb/seastar/issues/725

nyh commented 4 years ago

Committed @psarna's patch, "https://github.com/scylladb/seastar/issues/725" to next. When it reaches master, this issue will close.

nyh commented 4 years ago

No backports needed - this fixes a temporary situation that never reached any release.

amoskong commented 4 years ago

I can still see manager backup fail in latest longevity:

scylla-4.1.rc0-0.20200520.a1c15f06902.x86_64
job: https://jenkins.scylladb.com/view/scylla-4.1/job/scylla-4.1/job/longevity/job/longevity-2tb-4days-1Dis-2NonDis-Nemesises/2/

ManagementBackup Nemesis fail:

Command: 'sudo sctool backup -c 654d1897-58ee-4cc5-b58f-ca63e4dd6ff3 --location s3:manager-backup-tests-eu-west-1 --rate-limit 30'
Exit code: 1
Stdout:
Stderr:
Error: failed to create backup target: location is not accessible
 10.0.167.169: dial tcp 10.0.167.169:10001: connect: connection refused
Trace ID: s69s_E9RQ2mCqxYURDQPWw (grep in scylla-manager logs)

$ rpm -qa |grep scylla
scylla-conf-4.1.rc0-0.20200520.a1c15f06902.x86_64
scylla-jmx-4.1.rc0-20200520.ee72ec22888.noarch
scylla-debuginfo-4.1.rc0-0.20200520.a1c15f06902.x86_64
scylla-server-4.1.rc0-0.20200520.a1c15f06902.x86_64
scylla-tools-4.1.rc0-20200520.cf56d9273ab.noarch
scylla-machine-image-4.1.rc0-20200520.a790560ae2d.noarch
scylla-python3-3.7.7-0.20200520.a1c15f06902.x86_64
scylla-tools-core-4.1.rc0-20200520.cf56d9273ab.noarch
scylla-4.1.rc0-0.20200520.a1c15f06902.x86_64
scylla-manager-agent-2.0.2-0.20200401.ab6c6b96.x86_64
scylla-kernel-conf-4.1.rc0-0.20200520.a1c15f06902.x86_64

fruch commented 4 years ago

@amoskong why you do you think it's related to this issue ?

amoskong commented 4 years ago

@amoskong why you do you think it's related to this issue ?

I saw management backup fail in comment 0. It seems the error isn't completely same. It should be a different issue.

I will check if it's a new scylla issue.

nyh commented 6 months ago

Committed @psarna's patch, "scylladb/seastar#725" to next. When it reaches master, this issue will close.

I should not have closed this issue. The "fix" didn't fix the Seastar bug (https://github.com/scylladb/seastar/issues/725) and only avoiding it by not using the "#" character on all Alternator tables. But we still get the same bugs when trying to look at Alternator tables with the "!" character (which we used for LSI).

So I'm reopening this issue. We need to really fix the Seastar bug, and only close this issue with a Seastar update that includes that fix.

nyh commented 6 months ago

Sent Seastar PR to fix the bug: https://github.com/scylladb/seastar/pull/2162

scylladb / scylladb