scylladb / scylla-jmx

Scylla JMX proxy
GNU Affero General Public License v3.0
28 stars 51 forks source link

Some requests break when there is a table whose name has a colon #226

Closed nyh closed 7 months ago

nyh commented 8 months ago

Normally, CQL table names may only have alphanumeric characters or underscores. The DynamoDB API also adds dash and dot as allowed characters. But when Alternator creates a GSI or LSI, the materialized view name will also contain a ":" or "!" characters, respectively.

It turns out that scylla-jmx has a bug when a table contains the colon character: In src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java there is a function doCheck() there which loops over all tables returned by the /column_family REST API. This function calls thegetName() function and croaks when the table name as a colon. What this function does is:

    private static ObjectName getName(String type, String keyspace, String name) throws MalformedObjectNameException {
        return new ObjectName(
                "org.apache.cassandra.db:type=" + type + ",keyspace=" + keyspace + ",columnfamily=" + name);
    }

This is wrong if the given name may contain a colon. As https://docs.oracle.com/javase/8/docs/api/javax/management/ObjectName.html explains:

Each value associated with a key is a string of characters that is either unquoted or quoted. An unquoted value is a possibly empty string of characters which may not contain any of the characters comma, equals, colon, or quote.

So we need to quote the column family name in the above string, according to the quoting rules explained in https://docs.oracle.com/javase/8/docs/api/javax/management/ObjectName.html if it may contain any of these characters. Currently, only the colon is actually possible.

This issue cause a bug for nodetool users: Whenever there was an Alternator GSI somewhere in the database, nodetool info and nodetool tablestats stopped working, because both enumerate all the tables and come across this bug in their JMX request.

nyh commented 8 months ago

Unfortunately, this opened a Pandora's box :-(

The "obvious" change:

     private static ObjectName getName(String type, String keyspace, String name) throws MalformedObjectNameException {
+        // The quote() call below is necessary for the table name because it might contain
+        // a colon (see https://github.com/scylladb/scylla-jmx/issues/226), but better be
+        // safe than sorry, so we use it on all values in the ObjectName string below,
+        // as required by the ObjectName documentation. We don't bother to quote "type"
+        // because we know it came from our code.
         return new ObjectName(
-                "org.apache.cassandra.db:type=" + type + ",keyspace=" + keyspace + ",columnfamily=" + name);
+                "org.apache.cassandra.db:type=" + type + ",keyspace=" + ObjectName.quote(keyspace) + ",columnfamily=" + ObjectName.quote(name));
     }

Doesn't work on its own because it turns out that when ObjectName receives a quoted string, it keeps it quoted and ObjectName::getKeyProperty() then returns it with the quotes, which confuses this code. So we also need to unquote it:

     public ColumnFamilyStore(APIClient client, ObjectName name) {
-        this(client, name.getKeyProperty("type"), name.getKeyProperty("keyspace"), name.getKeyProperty("columnfamily"));
+        // The keyspace and table values were quote()ed when creating the Objectname (see
+        // getName()), and strangely name.getKeyProperty() returns them still quoted, so
+        // we need to unquote them back here.
+        this(client, name.getKeyProperty("type"), ObjectName.unquote(name.getKeyProperty("keyspace")), ObjectName.unquote(name.getKeyProperty("columnfamily")));
     }

After both changes, "nodetool info" works will with or without GSI. But "nodetool tablestats" stops working, even without GSI. nodetool (not JMX) now fails with:

E               stderr: error: org.apache.cassandra.metrics:type=Table,keyspace="alternator_some_long_table_name_abc",scope=some_long_table_name_abc,name=WriteLatency

Apparently something in this patch causes nodetool to receive the list of keyspace names (but not table names!) quoted, and when it passes it quoted to JMX, it doesn't work because of the broken handling of quoted strings in ObjectName where a quoted string is different from unquoted. I tried to figure out how to fix it, but haven't managed to, yet.

This is turning out to be a complicated project, investigating ugly areas of Java that I wish I never have encountered ;-)

nyh commented 8 months ago

Quoting only the table name, not the keyspace name, makes the two tests I'm trying (for nodetool info and notetool tablestats) work as expected, but I don't know if it won't break other things in a similar way. Even more sadly, it now appears that after the JMX problems are fixed, there is now also a Nodetool problem, when Nodetool attempts to access a table with a colon in its name:

E       javax.management.MalformedObjectNameException: Invalid character  ':' in value part of property
E               at java.management/javax.management.ObjectName.construct(ObjectName.java:622)
E               at java.management/javax.management.ObjectName.<init>(ObjectName.java:1407)
E               at org.apache.cassandra.tools.NodeProbe.getColumnFamilyMetric(NodeProbe.java:1463)

(again, note that this is a nodetool error, not JMX, so it is different from what is being discussed in this JMX issue).

elcallio commented 8 months ago

If keyspace/table names contain ":", does that not break the scylla API? It literally uses ":" as keyspace:table separator?

nyh commented 8 months ago

@elcallio indeed, CQL does not allow ":" in a table name. A colon is allowed in a quoted identifier, but such identifier cannot be used to name a table - CQL will refuse it. However, to make a long story short, Alternator uses ":" in the name of materialized views (maybe this wasn't a good idea - considering we have just one table in each keyspace we had better options - but that's what we did). You never access those views via CQL, so it doesn't matter that they have this weird character name. Internally, when storing sstables and what not, it's fine to have this ":" in the table name.

What's frustrating in this issue is that as soon as you have one Alternator GSI, some general stuff like "nodetool info" or "nodetool tablestats" stop working. It's worse then having problems when you just try to do something with the specific GSI.

elcallio commented 8 months ago

I don't mean CQL API. I mean the http REST API, which uses ks:cf as table identifier (e.g. /column_family/compaction_strategy/<ks>:<cf>). Obviously, as long as only table name, not keyspace name contains ":", the code splitting that parses this probably sort of works, but it is still a little fishy...

nyh commented 8 months ago

I don't mean CQL API. I mean the http REST API, which uses ks:cf as table identifier (e.g. /column_family/compaction_strategy/<ks>:<cf>). Obviously, as long as only table name, not keyspace name contains ":", the code splitting that parses this probably sort of works, but it is still a little fishy...

Yes, this is my understanding as well: The REST API which uses a colon like you said (I don't know why, but I guess for historic reasons) splits ks and cf on the first colon, so as long as the keyspace name can't contain a colon (and currently, it can't) we're safe on that front.