Internal reusable CQL statements should be prepared

psarna commented 2 years ago

Our driver issues a bunch of internal CQL statements periodically, e.g. to fetch topology and schema information. These statements are currently not prepared, which forces the database to meticulously parse their CQL over and over, which is a waste of CPU. The solution is to rewrite all internal statements so that they're prepared once, and then executed. We have a helper function used mainly for internal queries - query_all. Ideally, all call sites of that function should be translated to execute_all.

Example: https://github.com/scylladb/scylla-rust-driver/blob/2cf503bdcdc67a9e7a25fcd2c572737530750628/scylla/src/transport/topology.rs#L322

Ref: https://github.com/scylladb/scylla/issues/10225

Ponewor commented 2 years ago

It is not as easy as it may seem. The simplest approach would go like this:

Refactor all the functions doing queries into methods (this step applies only to schema and peers fetching functions from topology.rs).
For each query executed in a method:
1. add a respective field for a PreparedStatement in the parenting struct,
2. get a hand on anything than can actually do the preparing part of the job in the constructor,
3. prepare it in the constructor,
4. modify the method to use this statement.

Here I'm going to discuss some issues with this approach but I believe that at least some of them will still be relevant even if someone more clever than me comes out with a better solution. For the sake of clarity the parenting structs we will be dealing with are MetadataReader, Session, Connection. They may not be the only ones involved (but probably are) - my search was not too thorough, but for now let us focus our attention on them.

There are some issues with step 2.ii.:

For Session and Connection the only reasonable objects than can do the preparing are Session and Connection themselves respectively. But of course we cannot use their respective prepare methods before they get constructed which cannot be done without providing prepared values etc. etc.

So how to get out of this mess?

One way to do it would obviously be to store the statements as Option<PreparedStatement> and fill the value sometime later. Even if we ignore the glaring question - "when is sometime later?" and the obvious bloatiness of a code filled with Options there's still another, maybe a bit philosophical problem - aren't the structures that are not able to perform their internal queries so highly disfunctional that we cannot let them exist? What is the actual meaning of such an object with its internal queries equal to None? Maybe I am wrong and these are not problems at all but for me it feels like such a hack is unacceptable.

There is also another idea. We can simply delegate the preparing to some other object that we can access in the constructor by some refactor. This works great with Session. We can move the preparing logic from Session::prepare to the newly made Cluster::prepare and just call the latter from the former (which is probably a worthy refactor by its own right). Then we can simply do the preparing in the constructor using the Cluster object we have access to there. Very well. But this cannot be done for the Connection - there is nothing we can delegate this to (actually the only idea I see at the moment is to do some disgusting duplication of the Connection::prepare logic in the constructor).

Ok, so with that out of the way there's still the MetadataReader. In the constructor we have a NodeConnectionPool so it gives us a natural candidate for the preparing job - we can simply pick a random Connection from the pool and we're done. The thing is that this pool is empty, at least at the beginning - the tests are complaining. So this way we probably get back to the daunting idea of storing statements as Option<PreparedStatement> but if so then I would need answers to some questions like: when is MetadataReader's control_connection_pool filled? how can we make sure it is not empty?

There is an issue with step 2.iv:

The method that executes the queries is Session is try_getting_tracing_info and it modifies queries' consistencies. So this would require passing self as a &mut if we want to change queries into statements held in the Session. In result public methods such as get_tracing_info and get_tracing_info_custom would also need to get &mut self and this is an API breaking change. Even if we had no problem with that, it is still simply weird and counterintuitive that such an innocent-looking call like session.get_tracing_info() actually mutates the session.

piodul commented 2 years ago

It is not as easy as it may seem. The simplest approach would go like this:

Refactor all the functions doing queries into methods (this step applies only to schema and peers fetching functions from topology.rs).

For each query executed in a method:

add a respective field for a PreparedStatement in the parenting struct,

get a hand on anything than can actually do the preparing part of the job in the constructor,

prepare it in the constructor,

modify the method to use this statement.

Here I'm going to discuss some issues with this approach but I believe that at least some of them will still be relevant even if someone more clever than me comes out with a better solution. For the sake of clarity the parenting structs we will be dealing with are MetadataReader, Session, Connection. They may not be the only ones involved (but probably are) - my search was not too thorough, but for now let us focus our attention on them.

There are some issues with step 2.ii.:

For Session and Connection the only reasonable objects than can do the preparing are Session and Connection themselves respectively. But of course we cannot use their respective prepare methods before they get constructed which cannot be done without providing prepared values etc. etc.

So how to get out of this mess?

One way to do it would obviously be to store the statements as Option<PreparedStatement> and fill the value sometime later. Even if we ignore the glaring question - "when is sometime later?" and the obvious bloatiness of a code filled with Options there's still another, maybe a bit philosophical problem - aren't the structures that are not able to perform their internal queries so highly disfunctional that we cannot let them exist? What is the actual meaning of such an object with its internal queries equal to None? Maybe I am wrong and these are not problems at all but for me it feels like such a hack is unacceptable.

You can prepare the statements needed for fetching tracing info in Session::connect, right after the session object is initialized. We are already calling Session::use_keyspace there before returning. You can both prepare and use_keyspace in parallel.

There is also another idea. We can simply delegate the preparing to some other object that we can access in the constructor by some refactor. This works great with Session. We can move the preparing logic from Session::prepare to the newly made Cluster::prepare and just call the latter from the former (which is probably a worthy refactor by its own right). Then we can simply do the preparing in the constructor using the Cluster object we have access to there. Very well. But this cannot be done for the Connection - there is nothing we can delegate this to (actually the only idea I see at the moment is to do some disgusting duplication of the Connection::prepare logic in the constructor).

I think we should keep the logic of prepare in Session. This method doesn't sound like Cluster's responsibility, which just maintains connections and the metadata.

Ok, so with that out of the way there's still the MetadataReader. In the constructor we have a NodeConnectionPool so it gives us a natural candidate for the preparing job - we can simply pick a random Connection from the pool and we're done. The thing is that this pool is empty, at least at the beginning - the tests are complaining. So this way we probably get back to the daunting idea of storing statements as Option<PreparedStatement> but if so then I would need answers to some questions like: when is MetadataReader's control_connection_pool filled? how can we make sure it is not empty?

I think that preparing the statements right after (re)initializing control_connection_pool is the way to go for internal queries in the metadata reader. You can use .wait_until_initialized() to make sure that the pool has attempted to establish the connection - if the node is down/unreachable the pool will be empty and it is OK to fail and check another node, otherwise the pool should have a live connection.

There is an issue with step 2.iv:

The method that executes the queries is Session is try_getting_tracing_info and it modifies queries' consistencies. So this would require passing self as a &mut if we want to change queries into statements held in the Session. In result public methods such as get_tracing_info and get_tracing_info_custom would also need to get &mut self and this is an API breaking change. Even if we had no problem with that, it is still simply weird and counterintuitive that such an innocent-looking call like session.get_tracing_info() actually mutates the session.

All methods of Session must pass self as immutable reference, otherwise they will not be very useful. The Session is supposed to be shared between many tasks/threads. If you used &mut self then one would have to use RwLock<Session> which would suck because getting tracing info would prevent other threads from using the session during that time.

It's better to clone the prepared statement and set the consistency for the new statement. It's not perfect because it requires cloning, but doesn't require mutual exclusion, so it scales better.

akoshchiy commented 1 year ago

@psarna @Ponewor @piodul Hi! I'm looking for some rust practice and want to try to fix it. I've found, that there is the CachingSession, which internally use a cache for prepared statements (although I haven't found any usages of it, i guess it's using for client requests). Maybe is it better to extract the cache and use same struct for a both client and internal requests? Instances of the client and internal cache could be different.

piodul commented 1 year ago

Hi @akoshchiy,

Maybe is it better to extract the cache and use same struct for a both client and internal requests?

It would make sense to do it for the queries sent on the control connection. The slight overhead is not important there and it will be trivial to adapt the existing code that sends internal unprepared queries.

You could try to extract some code from the CachingSession and introduce a new CachingConnection that would be used in topology.rs file. It would only have to support execute and execute_iter, no need for batch.

Note that Connection::query_all mentioned in the issue description has been renamed to Connection::query_iter.

Instances of the client and internal cache could be different.

There are not too many types of internal queries, so it's perfectly fine and will probably simplify the implementation.

However, it's also worth mentioning that, while the issue's description was mainly about topology.rs, there are some other queries that are sent internally which should also be adapted - for example, in Session::try_getting_tracing_info. Those queries may potentially be sent very frequently, so for those I'd rather avoid having a hashmap lookup. For those, another abstraction would be useful - something like a single-element cache that prepares the statement on first use.

scylladb / scylla-rust-driver

Internal reusable CQL statements should be prepared #417