jupyter-xeus / xeus-sql

Jupyter kernel for SQL databases
https://xeus-sql.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
163 stars 22 forks source link

WITH clause doesn't work with SQLite #50

Open n87 opened 3 years ago

n87 commented 3 years ago

Tested on these queries in the cloud demo [1]:

with t as (select * from employees) select * from t
with t as (select 1) select 1

It doesn't output anything, but also doesn't show errors

[1] https://mybinder.org/v2/gh/jupyter-xeus/xeus-sql/stable?urlpath=lab/tree/examples/XVega%20operations.ipynb

marimeireles commented 3 years ago

Hey @n87, thank you for opening the issue :) That's true. I don't understand why it's not being picked up by SOCI though. We treat this case in the following lines:

            else
            {
                if (this->sql)
                {
                    /* Shows rich output for tables */
                    if (xv_bindings::case_insentive_equals("SELECT", tokenized_input[0]) ||
                        xv_bindings::case_insentive_equals("DESC", tokenized_input[0]) ||
                        xv_bindings::case_insentive_equals("DESCRIBE", tokenized_input[0]) ||
                        xv_bindings::case_insentive_equals("SHOW", tokenized_input[0]))
                    {
                        nl::json data = process_SQL_input(code, xv_sql_df);

                        publish_execution_result(execution_counter,
                                                 std::move(data),
                                                 nl::json::object());

                    }
                    /* Execute all SQL commands that don't output tables */
                    else
                    {
                        *this->sql << code;
                    }
                }

https://github.com/jupyter-xeus/xeus-sql/blob/master/src/xeus_sql_interpreter.cpp#L295-L317

Here are some things I'm thinking could be the culprit for this issue:

I'm not sure, these are just some hints. If you or anyone want to tackle this issue I'm here to help. Unfortunately I can't dedicate so much time to this project right now. (Maybe in the future!) But thanks again for opening this and please feel free to ask questions.

n87 commented 3 years ago

@marimeireles it looks like your snippet is just enough to explain the issue. You check that first token is one of (SELECT, DESC, DESCRIBE, SHOW). But SELECT queries can also start with WITH or with VALUES: https://sqlite.org/lang_select.html

Note that WITH can also precede DELETE, INSERT etc., so I'm not sure what the fix should be.

marimeireles commented 3 years ago

Hum! This is where my lack of knowledge in SQL attacks again! Sorry for this, I'm mostly a C++/Python dev and just know the very basics of SQL.

Yeah, seems a bit complicated. If I understand correctly this WITH clause is like a "variable(? did I get that right?)" that you can store the values of the query? And then you have the RECURSIVE modifier, even.

I can't tell if SOCI offers support for it, because they have no entries about WITH in their docs. And looking further, they have no WITH tests nor examples, maybe they don't support it. So first step as I said, is making sure they do. Maybe one could even open an issue upstream asking about it and how to use it.

Afterwards... I'd say the steps to fix this would be:

n87 commented 3 years ago

A more googlable term for WITH clause is Common Table Expression. It is like defining variables, but more closer equivalent is let clause in functional languages, e.g. Haskell:

let x = 2 in x * x

Somewhat contrived SQL(ite) equivalent:

> with t as (select 2 as x) select x * x from t;
x * x
----------
4
darkdreamingdan commented 2 years ago

Just came across this issue. Wanted to use Jupyter to teach Common Table Expressions, but seems it doesnt work :(.

Is it just a matter of whitelisting the missing WITH and VALUES as the first token in the parser? That feels like it could be a straightforward fix? Even if there are edge cases where the result does not need to output, it's still better than current behaviour which does not output at all.

Please could we just add this as a quick fix? 🙏

marimeireles commented 2 years ago

Hey @darkdreamingdan, this is not a simple issue, unfortunately. This is a couple of my hours of work, for sure and I can't invoice this project currently, that's why I left it open for the community. But I just thought on an interesting way of implementing it and I'll give it a try on my free time.

My explanation in the previous comment is a bit confusing but I think there are two parts of this issue:

  1. treating queries as a variable and saving them -> this is what I'm going to work on first, it seems relatively simple, from what I'm thinking...
  2. Implementing recursive behavior -> that's a bit more complicated and I still haven't thought on a way to do it

I'll try to tackle it this week. Will keep you updated of any progress.

Another thing you could try is opening an issue in SOCI and asking them about the possibility of including this feature. If they support it then I'll add it here, as it's very easy to :)

stevemarin commented 2 years ago

Seeing same issue, so I came here to see if this was previously reported. And you're already working it! Thank you very, very much for these wonderful tools!

:tracking:

scheb commented 1 year ago

Sorry to bring up this topic again. I'm having pretty much the same use case as @darkdreamingdan. I'd love to use xeus-sql to teach students how to use SQL. It's pretty much perfect for my use-case, except for this issue. We'd like to teach them the benefit of structuring queries using WITH for better readability.

I'm not having much knowledge about C++, but it seems to me, these lines here determine which kind of queries are displaying a result in the notebook, based on the first keyword. Syntactically the WITH statement has to come first in a SELECT query. Adding the WITH keyword there, could potentially fix it. But that's just a guess.

https://github.com/jupyter-xeus/xeus-sql/blob/315057db8bc81b71d13c9377782df162dcc8663c/src/xeus_sql_interpreter.cpp#L312-L315

Let me know what you think!

camelusminimus commented 2 months ago

If I may drop into this issue after such a long time, I am a large fan of CTE myself, and thus would be very happy if they could be taught early using jupyter...

I fear my understanding of the problem, and the way @marimeireles described it are a bit at odds. The following is my understanding:

CTE do not modify anything outside of the current query, they only provide a named binding to a (sub) query that (afaik) is limited to varying types of SELECT queries. But this binding is without consequence outside of the current query. Thus the Queries:

WITH
inner_query_a as (
   SELECT
       1 AS id,
       5 AS value
),
inner_query_b as (
   SELECT
       1 AS id,
       2 AS value
)

SELECT
    inner_query_a.value * inner_query_b.value
FROM inner_query_a, inner_query_b
WHERE inner_query_a.id=inner_query_b.id;

and

SELECT
    inner_query_a.value * inner_query_b.value
FROM
    (SELECT 1 AS id, 5 AS value) AS inner_query_a,
    (SELECT 1 AS id, 2 AS value) AS inner_query_b
WHERE
    inner_query_a.id = inner_query_b.id;

are completely identical. This is only a tool to better structure your query, thus there should not need to be any additional support from the SOCI/xeus-sql side, beside accepting that a query that starts with "WITH" can return a table.

The only potential proplem I see is that it does not have to return a table. From a high-level "What does the sql query do?" Point of view, the CTE does not really matter. Thus the simple "first token" approach to checking if the query returns some kind of result set is not valid. And fixing it to correctly ignore the CTE amounts to writing a more or less complete grammar of SQL...as understood by the various backends...yikes...

A question I would pose is, is this distinction needed after all? Due to the way "first line comments" are now allowed (as in commit adacbae525b526c072799a5d647c8c79725cbfeb ) this whitelisting approach is not really consistent anymore, or do I missunderstand anything?

Would it perhaps be possible to instead change to a blacklisting approach, i.e. define some queries that are sure not to return a result set from scanning the first line, and handle that, and show the resultset of all others, even if that resultset might be empty?