ploomber / jupysql

Better SQL in Jupyter. 📊
https://jupysql.ploomber.io
Apache License 2.0
723 stars 76 forks source link

SqlMagic.autolimit doesn't add a limit clause to the query #1006

Open takikadiri opened 7 months ago

takikadiri commented 7 months ago

What happens?

We're using jupysql to query a trino cluster through jupyter notebook. I want to autolimit the query results that we query to the cluster, as it's intended to be used for an interactive analytics workload.

I expect the SqlMagic.autolimit to add a LIMIT clause to the query, but it doesn't, i can see the original query in the Trino Admin UI.

To Reproduce

OS:

Linux

JupySQL Version:

0.10.10

Full Name:

Takieddine KADIRI

Affiliation:

Société Générale Assurances

edublancas commented 7 months ago

yeah, the effect of autolimit is not to add a LIMIT to the statement but to send the query and only fetch the top X results, are you seeing performance issues?

you can see the relevant code here.

the main reason not to add LIMIT, is that some DBs don't support it so we'd have to figure out a way to make it portable. I guess it wouldn't be too hard using sqlglot, but given that we support many databases, adding stuff like this ends up breaking features for some users.

perhaps a more reasonable solution would be to add LIMIT only to the dbs that we're sure they support it, but even detecting which db the user is connected to is challenging and we've encountered edge cases.

bottom line is that we've decided not to modify user-submitted queries to prevent these issues. if you have time, feel free to open a PR, perhaps there is another approach that I'm overlooking

takikadiri commented 7 months ago

Ah i see, The doc mislead me. Thank you for the quick feedback.

For the interactive analytics workload, we're also using Apache Superset, that manage to systematically LIMIT the query. That's why i was looking for this in jupysql

Maybe i could use jupysql for a slighly different use case, where the desired behavior is to collect the entire query results for further analysis with pandas or for doing joins with another data sources. I'll tests the user bahaviors and the cluster performance with jupysql and autopandas=True and let you know if the LIMIT clause is really needed here.

edublancas commented 7 months ago

sounds good, thanks for your feedback!