Open kdheepak opened 1 month ago
For an arbitrary query, computing the number of rows isn't cheap. For some data sources, like text files, computing the number of rows can be extremely expensive.
Polars can only you show you this information if you're using its DataFrame
structure, and not LazyFrame
s.
Adding the number of columns to the repr seems reasonable.
In interactive mode though I’m not really concerned about performance, right?
Why wouldn't performance be relevant in interactive mode?
I figure because in interactive mode I'm using exploring the data and I'm printing the dataframe often. After exploring and building the sequence of expressions, I move the code to a package into a function or method in a class and only then I really care about performance.
In interactive mode, I'm usually cleaning data or validating data and seeing the number of rows can speed up the process for me. For example, when working with timeseries data, seeing that there are 8760
rows tells me I have hourly data for 365 days of the year. Seeing 8784
would tell me it is a leap year or 8736
would tell me a day is missing. Not having it printed out by default means I have to explicitly look for these issues.
It also gives me a sense for how long a computation might take or how long it might take to write the data to disk. I'm not sure if this directly applies here, but when I've used pandas
in the past if I have a few thousand rows, I can use df.apply
with a python lambda and get away a hacky data cleaning workflow. If I have a few million rows, then I know have to look for a more performant alternative.
I also am assuming that people working with CSV files don't have too many rows that calculating rows becomes an issue. You definitely have a much better sense of how the performance varies across different backends of ibis, so I'd obviously defer to your judgement.
That said, an alternative to printing the shape would be printing the first N/2
rows and last N/2
rows, instead of printing the first N
rows as it does right now (if I'm reading the source correctly here). Usually there's an id
column or a timestamp column that provides the information I'm after.
Or alternatively, maybe just the printing to show the number of rows in backends that do support getting the number of rows in a fast manner?
I agree with @kdheepak . Maybe this could become another mode interactive_detailed
or something, that would give you both the row and column counts.
I am +1 on showing the row count by default in interactive mode. If your query is slow, then it's easy, just add a .head() call after it, and you are back to the current performance. I think we could make this discoverable as well: If a computing the row count takes <5 seconds, we just show the row count in the repr. If it takes more, then we still show the row count, but emit a warning eg "computing the row count was slow. You may be able to avoid this by only showing the first few rows using table.head()". I suggest a warning, and not just an addendum in the repr, because consider: I make a preview call, and it is taking forever, so I just cancel the whole operation, and therefore I never actually see the final suggestion. The suggestion should get shown at ~5 seconds, regardless of if the query completes.
Alternatively, this could be what the .preview() method is for. If you don't want the row count, you call that.
I do have sympathy for new users though. I would know how to mitigate a slow preview, but I'm not sure new users would, and we need to remove the sharp edges for them.
Usually there's an id column or a timestamp column that provides the information I'm after.
I don't think we should assume this, this is not true in the data I am working with.
first N/2 rows and last N/2 rows
-1 on this too, I would rather just have the first N rows and see the row count.
OK, now I'm going much broader than just this PR, but maybe useful to think about as context:
I think it's good to note that interactive mode already can be slow. I think in particular this comes up if there is some streaming-blocking-operation in the pipeline, like a sort, groupby, or window function. In these cases, previewing is gonna be slow either way, regardless of if we execute the row count or not. So this suggested change is only affecting the performance in certain circumstances.
Both the issue I say above, and the suggested change here, really present the same story: I am a user in interactive mode, and my previews are slow. What can ibis do to make this experience better? Perhaps if an interactive query takes more than ~10 seconds we emit a warning or print or log or add a note in the repr which links to a troubleshooting page in the docs? In this doc we could suggest the solutions of .cache() or .head(), and point out the pros/cons of each.
Perhaps it wasn't clear from what I said before, but computing count(*)
of query
and compute select * from query limit 11
in the worst case will require computing query
twice. That doesn't seem like an acceptable trade-off.
Counting the duration of the query would almost certainly require spinning up a thread to compute the duration of the count query, which also adds more complexity.
Perhaps the question is: why is the additional complexity, maintenance burden, discussion and consideration of edge cases required to show the row count worth taking on?
first N/2 rows and last N/2 rows
The first and last rows of a table in most of our backends will be completely arbitrary, because a table in a SQL database has no built in notion of row order, so this isn't feasible.
What do you think about having a user opt into printing row counts? e.g.
ibis.options.show_row_count_in_interactive_mode = True # todo: choose better name
Even a naive non-performant solution would be sufficient in this case.
I would love to get an opt-in way to automatically show row counts - I get that it shouldn't be the default but I'm using Ibis quite a bit with smallish tables and not being able to quickly glance at lengths makes it much harder to work with than other dataframe libraries. Adding this feature would be drastically improve the package as an alternative to pandas or dplyr in R for things like interactive debugging.
Not entirely opposed to an option for this but it requires someone else to do the work, and I'll review the PR/help get it over the finish line.
A trade-off to consider here is that this feature is likely to generate some new bug reports about performance, so there'll be extra work to do fielding those reports.
I have an idea on how this API could work, to unify it with .preview()
Table.__repr__
, we first try to use this per-instance config. If None, we fall back to the ibis-wide one.@cpcloud any thoughts on this API? Might be confusing, I can make a PR to make it concrete
@kdheepak @contang0 @lboller-pwbm you all chimed in so far, we would love your feedback on #10518
Looks very good, I have nothing to add apart from what @lboller-pwbm already mentioned. The row count should ideally be visible without needing to scroll horizontally if a table is wide, i.e. it should be left-justified.
By the way, sometimes I wish the backend name was also somehow indicated when inspecting a table. Sometimes I work with multiple backends in the same project and I have no idea whether I can join two tables before I try. I know it's probably a very niche problem, but something to keep in mind if more people have the same issue.
Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?
@cpcloud I think that PR is ready for your thoughts whenever you get the chance, thank you!
One of the major reasons I like using ibis is that API is cleaner than pandas / polars etc. I can usually write a lot less code and get the same things done. Most of the data I work with on a day to day basis is in the "few thousand rows" magnitude and I use the duckdb backend. I expect I'll have to scale in the future on at least one of my projects, and I'm hoping ibis will make that trivial by changing the backend.
I am more in the fifth bucket, currently - but I'm more hesitant to make the default something that would slow down interactive mode a lot at scale. I think an option like the one suggested in https://github.com/ibis-project/ibis/issues/10231#issuecomment-2387951607 would be good and/or letting the user set an environment variable to toggle between a fast or detailed version of interactive mode as the default depending on the project.
Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?
- ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference)
- ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference)
- ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)
- ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)
- ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)
environment variable
@lboller-pwbm I've wanted this too. I started #10523 for any discussion of that.
The way I access my data is usually similar to 1. I tend to start with a connection to Impala, and persist data to memory (duckdb) as soon as it's reasonably small. This is why seeing backend name would also be useful for me. Sometimes I get lost at which step I persisted the data to a memtable.
Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?
ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference) ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference) ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference) ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference) ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)
Is your feature request related to a problem?
When interactively exploring data, it's always an additional step to get the "shape" of the data. This is what the display looks like right now:
What is the motivation behind your request?
I find myself typing the following often:
Describe the solution you'd like
I'd like for the default display to print the shape of the dataframe (along with the name of the table too would be nice). Polars prints the shape of the data as the first line and it is very convenient:
What version of ibis are you running?
version 9.5.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct