livebook-dev / kino_db

Database integrations for Livebook
Apache License 2.0
40 stars 16 forks source link

Support query composition #24

Open josevalim opened 2 years ago

josevalim commented 2 years ago

We can make it so a query is "lazy", so it returns a query construct that can be interpolated into other queries in order to build CTEs.

greg-rychlewski commented 1 year ago

I was curious about trying this out. I'm a complete Live Book newb so I apologize in advance :).

Not sure how much you already planned/decided...these are some of the questions I was considering

  1. Do you want to make a new cell or use the existing SQL cell? If using the existing cell should the user be able to choose whether they save just the result, just the query, or both.
  2. I think this can also be used for subqueries and not just CTEs? I guess it depends what the UI is (see next point).
  3. Were you thinking of letting the user interpolate the query into the SQL text or did you want to have controls on the top to say the query starts with certain CTEs. If you allow interpolation then you could have subqueries. But if controls on the top then I think it only makes sense for CTEs.
  4. Should the query construct only be accessible from SQL cells?
greg-rychlewski commented 1 year ago

Oh also one more thing, did you already consider allowing %Ecto.Query{} in the SQL cell?

I'm not totally clear how this works yet but I thought maybe if the cell evaluates to a query struct or maybe can be supplied in the UI as a variable.

josevalim commented 1 year ago

Hi @greg-rychlewski! 👋 Here is what I was thinking:

  1. In the Smart cell, in the database select, we will be able to choose "Composable SQL (CTE)"
  2. If you choose this option, the result of the query will be a Kino.DB.CTE struct with two fields: query and params
  3. Now, if you interpolate a Kino.DB.CTE in another smart cell, we will emit a slightly different code. For example, if you do:
SELECT * FROM foo FROM {{sample_cte}} AS s join s.id == id  and status == {{var}}

It will become something like:

query(
  """
  WITH "sample_cte" AS (#{sample_cte.query})
  SELECT * FROM foo FROM sample_cte AS s join s.id == id and status == $2
  """,
  sample_cte.params ++ [status]
)

Some notes:

  1. If someone changes the sample_cte query to add new parameters, then we need to change the parameter position from $2 to $3 and so on. This will be fine while using the smart cell but if someone converts it to code they will need to adjust it manually (which is fine IMO)

  2. Each Kino.DB adapter will have to say if they support CTEs or not

  3. We should implement String.Chars to Kino.DB.CTE to raise a clear error message in case someone tries to interpolate it in an unexpected place or in a way we cannot parse upfront

@greg-rychlewski does this answer your questions? :)

greg-rychlewski commented 1 year ago

Makes sense, thank you! I'll give it a shot.

greg-rychlewski commented 1 year ago

I started playing around with this. One thing that came up is that each database will have different placeholders for the parameters: $n for postgres, ? for mysql, ?n for sqlite. So this means there is no single way to save the CTE with parameters that won't need modification at query time.

Some ways I can think to handle it:

  1. Do the parameterization at query time. I think this won't be so great because if the variable used as the parameter changes after they save the CTE it may have unexpected results.
  2. Choose a simple placeholder like ? and then substitute it for the correct one at query time
  3. Make the user choose which database the CTE is for.

I think (3) makes it easier for the dev but worse UX. 2 seems the best to me as long as there are no databases that use ? as an operator. Or if that's a concern then maybe using {{ ? }}. Maybe ? is fine though since it's used for fragments in Ecto.

josevalim commented 1 year ago

@greg-rychlewski We can do option 3 for now and then, instead of having an option in the select to be lazy, we can have a checkbox that makes the query as lazy. :)

josevalim commented 1 year ago

@greg-rychlewski actually, ignore me. Even if we use $1, $2, $3 for Postgres, we will have to rewrite those to $3, $4, $5 depending on the number of queries we have, so going down with 3 won't make life easier for Postgrex.

josevalim commented 1 year ago

I think we will need to go one level of abstraction higher. For example, we could emit this code:

{[sample_cte_query], params} = Kino.DB.to_ctes([sample_cte])

query(
  """
  WITH "sample_cte" AS (#{sample_cte_query})
  SELECT * FROM foo FROM sample_cte AS s join s.id == id and status == $2
  """,
  params ++ [status]
)

or this code:

{ctes, cte_params} = Kino.DB.to_ctes([sample_cte: sample_cte])

query(
  """
  WITH #{ctes}
  SELECT * FROM foo FROM sample_cte AS s join s.id == id and status == $2
  """,
  cte_params ++ [status]
)

And then for PG we have a slightly different version which is called to_indexed_ctes. It depends how much of the process we want to expose. The internal representation can be simple and mirror Ecto fragments (i.e. a list of {:query | :param, value} tuples).

Thoughts?

greg-rychlewski commented 1 year ago

Sorry for the delayed response @josevalim. If I'm understanding correctly, then I think what you proposed is a very nice solution. Do I understand correctly:

Someone creates 2 CTEs (i'm putting the variable declaration together with the queries just to make it easier to write, but I understand they will be in different cells):

name1 = "bob"
name2 = "joe"

select * from users where name = {{ name1 }} limit 1
select * from users where name = {{ name2 }} limit 1

These will get converted into the following structs:

cte1 = %Kino.DB.CTE{parts: [query: "select * from users where id = ", param: "bob", query: " limit 1"]}
cte2 = %Kino.DB.CTE{parts: [query: "select * from users where id = ", param: "joe", query: " limit 1"]}

Then when interpolated into a new query

salary_threshold = 100

select * from {{ cte1 }} join {{ cte2 }} where cte1.salary  > {{ salary_threshold }}

the %Kino.DB.CTE{} interpolations will be separated out and their queries will be parsed and parameters collected (indexed or not depending on the db):

{ctes, cte_params} = Kino.DB.to_ctes([cte1: cte1, cte2: cte2])

# ctes = """
#   WITH cte1 AS (
#     select * from users where name = $1 limit 1
#   ), cte2 AS (
#    select * from users where name = $2 limit 1
#  )
#
# cte_params = ["bob", "joe"]

query(
  """ 
  #{ctes}
  SELECT * from cte1 join cte2 where cte1.salary > $3,
  """,
  ["bob", "joe", 100]
)

If no misunderstanding above, the only other situation I think we need to consider is nested CTEs. Because it's pretty common to have one CTE reference another.

I don't think we want to create the with ... stuff as part of the CTE struct, right? Because it will just have to be rewritten when it's used in an actual query. If this is the case then I think we need to add a :name field to the Kino.DB.CTE struct? That way when we collect CTEs while parsing the query we have a name to replace it with.

And then there is the situation of duplicate names. It could be the case that a nested CTE is saved with some variable name. And then later on that variable name is reused for a different CTE. And then both of these are used in the same query. i.e:

cte1 = select * from users where name = "bob" limit 1
cte2 = select * from {{ cte1 }} limit 1
cte1 = select * from users where name = "dan" limit 2
query = select * from {{ cte2 }} join {{ cte1 }}

So it's a bit of a weird situation because it's fine to re-use the name as long as the query is the same. I can think of 2 choices:

  1. Caution the user that variable names are used to create the CTE names so be careful
  2. If a duplicate name is found, compare the queries and if they are qual then it's ok. If they are not equal then raise? Or keep a tally of how many times the cte name was used and append it to the name to create a new one? Not sure if this situation is worth adding that complexity to the code. If this option is worth it at all probably raising is the better option?
josevalim commented 1 year ago

Great points about nested CTEs! The name is defined in an input of the smart cell, so we should be able to forbid redefinitions or error/warn in said cases!

josevalim commented 1 year ago

About nested CTEs, your proposed structure needs to change a tiny bit but otherwise it is spot on. Instead of this:

cte1 = %Kino.DB.CTE{parts: [query: "select * from users where id = ", param: "bob", query: " limit 1"]}

We will need to have something like:

query: binary | param: {value, maybe_cte_name}

Where maybe_cte_name is the variable name of the interpolation in case it might be a CTE.

Then:

{ctes, cte_params} = Kino.DB.to_ctes([cte1: cte1, cte2: cte2])

can recursively traverse and handle any nested CTE.

greg-rychlewski commented 1 year ago

Oh and sorry I had a newb question somewhat related to this. Are all the database adapters contained in this repo or is there some way for people to create, say, a Cassandra adapter and plug it into this? If no way currently is that something you think is doable/desirable?

josevalim commented 1 year ago

Everything is contained in this repo. But anyone can create they own smart cells.

josevalim commented 1 year ago

But we will be glad to support other databases too!

greg-rychlewski commented 1 year ago

Sorry you might have already said no to this, just not 100% sure :P.

Would you be interested in a PR to create a Kino.DB.Adapter behaviour and refactor the current ones into using it? And then if someone wants to create an adapter they can just implement the behaviour and configure it to be used with this smart cell.

Not even sure if possible. Just enjoying working on this repo and trying to think if there's anything I can contribute after the CTE stuff.

josevalim commented 1 year ago

@greg-rychlewski I am not sure because it is not only about the adapter behaviour, but also the generated code, and the interface, and I am not sure all of those can be encapsulated behind an adapter with adding potentially a lot more complexity (such as encapsulating the forms themselves). :)

I am also not even sure this package will fully exist as is in the long term. Explorer is definitely a better tool for data exploration/processing and, if this becomes a reality, then it will be the best tool for data processing/exploration available. So I would go baby steps for now until we have a complete vision.

WDYT?

greg-rychlewski commented 1 year ago

Oh yes I see what you're saying about Explorer. I'm a bit out of touch with this space but that makes a lot of sense. And I didn't consider what you said about the interface/generated code but that's very true.

I'll stick to this PR :).