[Ruby] Add a Pipelining API?

casperisfine commented 3 months ago

Context

Since MySQL server is generally hosted on a different machine, an important performance factor is the number of roundtrips you have to perform.

e.g. if you have a 1ms round trip time, and 3 queries to perform in a transaction, you'll have to do 5 full roundtrips for a total of 5ms. Sometimes there is no way around this because you need the result of the previous query to perform the next one, but sometimes you don't.

The classic technique for minimizing the cost of roundtrips, is to send multiple queries in a row without immediately waiting for the response, AKA pipelining

Existing Solution with Trilogy

Right now some people are essentially doing poor man's pipelining by enabling MULTI_STATEMENTS and concatenating their queries together. It works but is wonky and a bit unsafe, as MULTI_STATEMENTS is off by default for a good reason, it prevent various injection vulnerabilities.

Existing Solution with other DB clients

Mysql2

The mysql2 gem has some limited support for pipelining, which they refer to as async: https://github.com/brianmario/mysql2?tab=readme-ov-file#async

It's a bit wonky because they only offer to queue the query and then block on async_result, but I suppose it works.

PG

The pg gem has some support for a pipeline API since PostgreSQL 14 (2021), I haven't tried it, and it seems a bit wonky, but it's the same idea. You start a pipeline with #enter_pipeline_mode, queue some queries in that pipeline, and can flush the pipelined queries whenever you want so they are sent all at once.

Here's the spec for it in the gem: https://github.com/ged/ruby-pg/blob/d072b21852865ecb84e6345df11d68eed50702bb/spec/pg/connection_spec.rb#L2009-L2122

Other databases

While pipelining isn't very commonly used with RDBMS, it has been used extensively for years with redis-rb and redis-client.

redis-client only offer a simple block based API, and return all the results at once in an Array:

result_1, result_2 = client.pipelined do |pipeline|
  pipeline.call("COMMAND 1")
  pipeline.call("COMMAND 2")
end

redis-rb offer some more convenience by having pipelined queries return a Future object as well.

future_1, future_2
result_1, result_2 = client.pipelined do |pipeline|
  future_1 = pipeline.command1
  future_2 = pipeline.command2
end
future_1.value == result_1
future_2.value == result_2

Some older versions of redis-rb also had a "queue" API I deprecated:

client.queue(:command_1)
client.queue(:command_2)
result_1, result_2 = client.commit

Use case

As mentioned in introduction, the use case for such API would be to optimized some specific cases, particularly around transactions. If your queries aren't in a transaction, you may simply want to perform the queries concurrently each in their own connection, but as soon as transactions are involved you can't do this anymore.

One specific place where I'd like to use this is in Active Record's transaction code. Right now Active Record's BEGIN queries are lazy, they are only send when the first query of the transaction is triggered.

If Trilogy had such pipelining API, we could send the BEGIN and the first query of the transaction in a pipeline, and save one roundtrip per transaction.

FYI @kirs @matthewd @jhawthorn @composerinteralia @adrianna-chang-shopify @eileencodes any opinions / concerns about introducing such capability? I'm not set on any specific API design, but I'm happy to explore, etc.

casperisfine commented 4 weeks ago

I opened https://github.com/trilogy-libraries/trilogy/pull/197 to demonstrate how it's possible to pipeline queries. I went with the simplest possible API, but might be worth offering something more flexible perhaps. But as it stands, it would already be useful as a better alternative to multi-statements.

gmac commented 2 weeks ago

This would be hugely beneficial for certain types of GraphQL batching. GraphQL fields are commonly resolved in bulk across a set of records, and sometimes require a unique query per field that goes beyond the capabilities of simple key batching (ie: select * from my_things where id IN (1, 2, 3)).

As mentioned in the issue, the ideal solution would be concurrency. However, that assumes infra can deliver warmed connections quickly and can handle the onslaught of every request process opening additional database connections. Realistically we have to use what we have – which is one connection per request; at which time being able to pipeline batches of queries through that one connection and save hundreds of round trips would provide considerable advantage.

trilogy-libraries / trilogy