Open josevalim opened 2 years ago
I was curious about trying this out. I'm a complete Live Book newb so I apologize in advance :).
Not sure how much you already planned/decided...these are some of the questions I was considering
Oh also one more thing, did you already consider allowing %Ecto.Query{}
in the SQL cell?
I'm not totally clear how this works yet but I thought maybe if the cell evaluates to a query struct or maybe can be supplied in the UI as a variable.
Hi @greg-rychlewski! 👋 Here is what I was thinking:
Kino.DB.CTE
struct with two fields: query and paramsSELECT * FROM foo FROM {{sample_cte}} AS s join s.id == id and status == {{var}}
It will become something like:
query(
"""
WITH "sample_cte" AS (#{sample_cte.query})
SELECT * FROM foo FROM sample_cte AS s join s.id == id and status == $2
""",
sample_cte.params ++ [status]
)
Some notes:
If someone changes the sample_cte
query to add new parameters, then we need to change the parameter position from $2 to $3 and so on. This will be fine while using the smart cell but if someone converts it to code they will need to adjust it manually (which is fine IMO)
Each Kino.DB adapter will have to say if they support CTEs or not
We should implement String.Chars to Kino.DB.CTE
to raise a clear error message in case someone tries to interpolate it in an unexpected place or in a way we cannot parse upfront
@greg-rychlewski does this answer your questions? :)
Makes sense, thank you! I'll give it a shot.
I started playing around with this. One thing that came up is that each database will have different placeholders for the parameters: $n
for postgres, ?
for mysql, ?n
for sqlite. So this means there is no single way to save the CTE with parameters that won't need modification at query time.
Some ways I can think to handle it:
?
and then substitute it for the correct one at query timeI think (3) makes it easier for the dev but worse UX. 2 seems the best to me as long as there are no databases that use ?
as an operator. Or if that's a concern then maybe using {{ ? }}
. Maybe ?
is fine though since it's used for fragments in Ecto.
@greg-rychlewski We can do option 3 for now and then, instead of having an option in the select to be lazy, we can have a checkbox that makes the query as lazy. :)
@greg-rychlewski actually, ignore me. Even if we use $1, $2, $3 for Postgres, we will have to rewrite those to $3, $4, $5 depending on the number of queries we have, so going down with 3 won't make life easier for Postgrex.
I think we will need to go one level of abstraction higher. For example, we could emit this code:
{[sample_cte_query], params} = Kino.DB.to_ctes([sample_cte])
query(
"""
WITH "sample_cte" AS (#{sample_cte_query})
SELECT * FROM foo FROM sample_cte AS s join s.id == id and status == $2
""",
params ++ [status]
)
or this code:
{ctes, cte_params} = Kino.DB.to_ctes([sample_cte: sample_cte])
query(
"""
WITH #{ctes}
SELECT * FROM foo FROM sample_cte AS s join s.id == id and status == $2
""",
cte_params ++ [status]
)
And then for PG we have a slightly different version which is called to_indexed_ctes
. It depends how much of the process we want to expose. The internal representation can be simple and mirror Ecto fragments (i.e. a list of {:query | :param, value} tuples).
Thoughts?
Sorry for the delayed response @josevalim. If I'm understanding correctly, then I think what you proposed is a very nice solution. Do I understand correctly:
Someone creates 2 CTEs (i'm putting the variable declaration together with the queries just to make it easier to write, but I understand they will be in different cells):
name1 = "bob"
name2 = "joe"
select * from users where name = {{ name1 }} limit 1
select * from users where name = {{ name2 }} limit 1
These will get converted into the following structs:
cte1 = %Kino.DB.CTE{parts: [query: "select * from users where id = ", param: "bob", query: " limit 1"]}
cte2 = %Kino.DB.CTE{parts: [query: "select * from users where id = ", param: "joe", query: " limit 1"]}
Then when interpolated into a new query
salary_threshold = 100
select * from {{ cte1 }} join {{ cte2 }} where cte1.salary > {{ salary_threshold }}
the %Kino.DB.CTE{}
interpolations will be separated out and their queries will be parsed and parameters collected (indexed or not depending on the db):
{ctes, cte_params} = Kino.DB.to_ctes([cte1: cte1, cte2: cte2])
# ctes = """
# WITH cte1 AS (
# select * from users where name = $1 limit 1
# ), cte2 AS (
# select * from users where name = $2 limit 1
# )
#
# cte_params = ["bob", "joe"]
query(
"""
#{ctes}
SELECT * from cte1 join cte2 where cte1.salary > $3,
""",
["bob", "joe", 100]
)
If no misunderstanding above, the only other situation I think we need to consider is nested CTEs. Because it's pretty common to have one CTE reference another.
I don't think we want to create the with ...
stuff as part of the CTE struct, right? Because it will just have to be rewritten when it's used in an actual query. If this is the case then I think we need to add a :name
field to the Kino.DB.CTE
struct? That way when we collect CTEs while parsing the query we have a name to replace it with.
And then there is the situation of duplicate names. It could be the case that a nested CTE is saved with some variable name. And then later on that variable name is reused for a different CTE. And then both of these are used in the same query. i.e:
cte1 = select * from users where name = "bob" limit 1
cte2 = select * from {{ cte1 }} limit 1
cte1 = select * from users where name = "dan" limit 2
query = select * from {{ cte2 }} join {{ cte1 }}
So it's a bit of a weird situation because it's fine to re-use the name as long as the query is the same. I can think of 2 choices:
Great points about nested CTEs! The name is defined in an input of the smart cell, so we should be able to forbid redefinitions or error/warn in said cases!
About nested CTEs, your proposed structure needs to change a tiny bit but otherwise it is spot on. Instead of this:
cte1 = %Kino.DB.CTE{parts: [query: "select * from users where id = ", param: "bob", query: " limit 1"]}
We will need to have something like:
query: binary | param: {value, maybe_cte_name}
Where maybe_cte_name
is the variable name of the interpolation in case it might be a CTE.
Then:
{ctes, cte_params} = Kino.DB.to_ctes([cte1: cte1, cte2: cte2])
can recursively traverse and handle any nested CTE.
Oh and sorry I had a newb question somewhat related to this. Are all the database adapters contained in this repo or is there some way for people to create, say, a Cassandra adapter and plug it into this? If no way currently is that something you think is doable/desirable?
Everything is contained in this repo. But anyone can create they own smart cells.
But we will be glad to support other databases too!
Sorry you might have already said no to this, just not 100% sure :P.
Would you be interested in a PR to create a Kino.DB.Adapter
behaviour and refactor the current ones into using it? And then if someone wants to create an adapter they can just implement the behaviour and configure it to be used with this smart cell.
Not even sure if possible. Just enjoying working on this repo and trying to think if there's anything I can contribute after the CTE stuff.
@greg-rychlewski I am not sure because it is not only about the adapter behaviour, but also the generated code, and the interface, and I am not sure all of those can be encapsulated behind an adapter with adding potentially a lot more complexity (such as encapsulating the forms themselves). :)
I am also not even sure this package will fully exist as is in the long term. Explorer is definitely a better tool for data exploration/processing and, if this becomes a reality, then it will be the best tool for data processing/exploration available. So I would go baby steps for now until we have a complete vision.
WDYT?
Oh yes I see what you're saying about Explorer. I'm a bit out of touch with this space but that makes a lot of sense. And I didn't consider what you said about the interface/generated code but that's very true.
I'll stick to this PR :).
We can make it so a query is "lazy", so it returns a query construct that can be interpolated into other queries in order to build CTEs.