jupyter-xeus / xeus-sql

Jupyter kernel for SQL databases
https://xeus-sql.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
164 stars 22 forks source link

paging for large result sets? #59

Open jameshowison opened 2 years ago

jameshowison commented 2 years ago

I'm running into responsiveness issues with queries with many results (~100,000 order of magnitude).

If it was just me then I'd add LIMIT statements, remembering to remove them later in the query buildup. But I'm working with students and asking doesn't make it so.

Any chance that the result sets could be paged, with the interface only handling those currently being looked at? This is a common approach in things like phpmyadmin or other sql clients, but I don't know whether SOCI makes that easy or not?

marimeireles commented 2 years ago

hey @jameshowison, thanks for opening the issue. I was reading the discussion you had on the issue linked. Even if this is not a xeus-sql issue per-se is still an interesting question.

Any chance that the result sets could be paged, with the interface only handling those currently being looked at? This is a common approach in things like phpmyadmin or other sql clients, but I don't know whether SOCI makes that easy or not?

Do you know how this is implemented under the hood? If it's being paged it means they're running the statement again with a LIMIT set by them?

I don't know how we could implement something sensitive to being watched by the user, but we could think on some kind of pagination.

jameshowison commented 2 years ago

Yeah, I think phpmyadmin and others just use LIMIT and OFFSET. Although I'm far from sure (it's what phpmyadmin shows as you browse). I found this note about using resultset/cursor and postgres specific FETCH. Perhaps something like that is implemented with SOCI?

https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/

Perhaps some of the impact can be resolved using something like https://pypi.org/project/jupyterlab-limit-output/

marimeireles commented 2 years ago

Perhaps some of the impact can be resolved using something like https://pypi.org/project/jupyterlab-limit-output/

Yeah, that seems like something that should work out of the box, tbh.

jameshowison commented 2 years ago

Thinking about this more, perhaps an across platform approach that protects against the most common error cases without the hassle of pagination would be to add LIMIT 100 to each query (unless another LIMIT was there) and include a link at the bottom to "Show All results" which removes the LIMIT 100?

I think needed to interactively look at more than 100 results would be exceedingly rare (I mean 100 is too many to look through anyway :)

jameshowison commented 2 years ago

Just a note that I ran into this today with Jupyter becoming unresponsive with a 38MB ipynb file (resulting from large results from SELECT * FROM table queries.

jameshowison commented 2 years ago

Just a note here (in case others find this in future) that https://github.com/kynan/nbstripout is able to resurrect files that have become unresponsive due to containing very large amounts of output.

jameshowison commented 2 years ago

What about adding a parameter to the LOAD, something like limit_output = 100 would be a good explicit way to handle this. While adding a LIMIT 100 to the query enroute to the server would be one approach, another could be simply to truncate the results to never fill up output to a level that can make the page unresponsive. So something truncating rows around xeus_sql_interpreter.cpp:93?

marimeireles commented 2 years ago

Thinking about this more, perhaps an across platform approach that protects against the most common error cases without the hassle of pagination would be to add LIMIT 100 to each query (unless another LIMIT was there) and include a link at the bottom to "Show All results" which removes the LIMIT 100?

I think needed to interactively look at more than 100 results would be exceedingly rare (I mean 100 is too many to look through anyway :)

This would also work.

I'm not sure I understand your last comment. How would that differ from using LIMIT?

jameshowison commented 2 years ago

Last comment meant that I almost never manually look at results above 100 rows, I load larger results into a data analysis or visualization program. But I think having this configurable using limit_output = in the load cell is a great option.