ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.34k stars 599 forks source link

feat: show `df.count(), len(df.schema())` when displaying `interactive = True` tables #10231

Open kdheepak opened 1 month ago

kdheepak commented 1 month ago

Is your feature request related to a problem?

When interactively exploring data, it's always an additional step to get the "shape" of the data. This is what the display looks like right now:

image

What is the motivation behind your request?

I find myself typing the following often:

print(df.count(), len(df.schema()))

Describe the solution you'd like

I'd like for the default display to print the shape of the dataframe (along with the name of the table too would be nice). Polars prints the shape of the data as the first line and it is very convenient:

image

What version of ibis are you running?

version 9.5.0

image

What backend(s) are you using, if any?

DuckDB

Code of Conduct

cpcloud commented 1 month ago

For an arbitrary query, computing the number of rows isn't cheap. For some data sources, like text files, computing the number of rows can be extremely expensive.

Polars can only you show you this information if you're using its DataFrame structure, and not LazyFrames.

Adding the number of columns to the repr seems reasonable.

kdheepak commented 1 month ago

In interactive mode though I’m not really concerned about performance, right?

cpcloud commented 1 month ago

Why wouldn't performance be relevant in interactive mode?

kdheepak commented 1 month ago

I figure because in interactive mode I'm using exploring the data and I'm printing the dataframe often. After exploring and building the sequence of expressions, I move the code to a package into a function or method in a class and only then I really care about performance.

In interactive mode, I'm usually cleaning data or validating data and seeing the number of rows can speed up the process for me. For example, when working with timeseries data, seeing that there are 8760 rows tells me I have hourly data for 365 days of the year. Seeing 8784 would tell me it is a leap year or 8736 would tell me a day is missing. Not having it printed out by default means I have to explicitly look for these issues.

It also gives me a sense for how long a computation might take or how long it might take to write the data to disk. I'm not sure if this directly applies here, but when I've used pandas in the past if I have a few thousand rows, I can use df.apply with a python lambda and get away a hacky data cleaning workflow. If I have a few million rows, then I know have to look for a more performant alternative.

I also am assuming that people working with CSV files don't have too many rows that calculating rows becomes an issue. You definitely have a much better sense of how the performance varies across different backends of ibis, so I'd obviously defer to your judgement.


That said, an alternative to printing the shape would be printing the first N/2 rows and last N/2 rows, instead of printing the first N rows as it does right now (if I'm reading the source correctly here). Usually there's an id column or a timestamp column that provides the information I'm after.

Or alternatively, maybe just the printing to show the number of rows in backends that do support getting the number of rows in a fast manner?

contang0 commented 1 month ago

I agree with @kdheepak . Maybe this could become another mode interactive_detailed or something, that would give you both the row and column counts.

NickCrews commented 1 month ago

I am +1 on showing the row count by default in interactive mode. If your query is slow, then it's easy, just add a .head() call after it, and you are back to the current performance. I think we could make this discoverable as well: If a computing the row count takes <5 seconds, we just show the row count in the repr. If it takes more, then we still show the row count, but emit a warning eg "computing the row count was slow. You may be able to avoid this by only showing the first few rows using table.head()". I suggest a warning, and not just an addendum in the repr, because consider: I make a preview call, and it is taking forever, so I just cancel the whole operation, and therefore I never actually see the final suggestion. The suggestion should get shown at ~5 seconds, regardless of if the query completes.

Alternatively, this could be what the .preview() method is for. If you don't want the row count, you call that.

I do have sympathy for new users though. I would know how to mitigate a slow preview, but I'm not sure new users would, and we need to remove the sharp edges for them.

Usually there's an id column or a timestamp column that provides the information I'm after.

I don't think we should assume this, this is not true in the data I am working with.

first N/2 rows and last N/2 rows

-1 on this too, I would rather just have the first N rows and see the row count.

OK, now I'm going much broader than just this PR, but maybe useful to think about as context:

I think it's good to note that interactive mode already can be slow. I think in particular this comes up if there is some streaming-blocking-operation in the pipeline, like a sort, groupby, or window function. In these cases, previewing is gonna be slow either way, regardless of if we execute the row count or not. So this suggested change is only affecting the performance in certain circumstances.

Both the issue I say above, and the suggested change here, really present the same story: I am a user in interactive mode, and my previews are slow. What can ibis do to make this experience better? Perhaps if an interactive query takes more than ~10 seconds we emit a warning or print or log or add a note in the repr which links to a troubleshooting page in the docs? In this doc we could suggest the solutions of .cache() or .head(), and point out the pros/cons of each.

cpcloud commented 1 month ago

Perhaps it wasn't clear from what I said before, but computing count(*) of query and compute select * from query limit 11 in the worst case will require computing query twice. That doesn't seem like an acceptable trade-off.

cpcloud commented 1 month ago

Counting the duration of the query would almost certainly require spinning up a thread to compute the duration of the count query, which also adds more complexity.

Perhaps the question is: why is the additional complexity, maintenance burden, discussion and consideration of edge cases required to show the row count worth taking on?

cpcloud commented 1 month ago

first N/2 rows and last N/2 rows

The first and last rows of a table in most of our backends will be completely arbitrary, because a table in a SQL database has no built in notion of row order, so this isn't feasible.

kdheepak commented 1 month ago

What do you think about having a user opt into printing row counts? e.g.

ibis.options.show_row_count_in_interactive_mode = True # todo: choose better name

Even a naive non-performant solution would be sufficient in this case.

lboller-pwbm commented 1 month ago

I would love to get an opt-in way to automatically show row counts - I get that it shouldn't be the default but I'm using Ibis quite a bit with smallish tables and not being able to quickly glance at lengths makes it much harder to work with than other dataframe libraries. Adding this feature would be drastically improve the package as an alternative to pandas or dplyr in R for things like interactive debugging.

cpcloud commented 3 weeks ago

Not entirely opposed to an option for this but it requires someone else to do the work, and I'll review the PR/help get it over the finish line.

A trade-off to consider here is that this feature is likely to generate some new bug reports about performance, so there'll be extra work to do fielding those reports.

NickCrews commented 2 weeks ago

I have an idea on how this API could work, to unify it with .preview()

@cpcloud any thoughts on this API? Might be confusing, I can make a PR to make it concrete

NickCrews commented 3 days ago

@kdheepak @contang0 @lboller-pwbm you all chimed in so far, we would love your feedback on #10518

contang0 commented 2 days ago

Looks very good, I have nothing to add apart from what @lboller-pwbm already mentioned. The row count should ideally be visible without needing to scroll horizontally if a table is wide, i.e. it should be left-justified.

contang0 commented 2 days ago

By the way, sometimes I wish the backend name was also somehow indicated when inspecting a table. Sometimes I work with multiple backends in the same project and I have no idea whether I can join two tables before I try. I know it's probably a very niche problem, but something to keep in mind if more people have the same issue.

NickCrews commented 2 days ago

Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?

  1. ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference)
  2. ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference)
  3. ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)
  4. ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)
  5. ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)
NickCrews commented 2 days ago

@cpcloud I think that PR is ready for your thoughts whenever you get the chance, thank you!

kdheepak commented 2 days ago

One of the major reasons I like using ibis is that API is cleaner than pandas / polars etc. I can usually write a lot less code and get the same things done. Most of the data I work with on a day to day basis is in the "few thousand rows" magnitude and I use the duckdb backend. I expect I'll have to scale in the future on at least one of my projects, and I'm hoping ibis will make that trivial by changing the backend.

lboller-pwbm commented 2 days ago

I am more in the fifth bucket, currently - but I'm more hesitant to make the default something that would slow down interactive mode a lot at scale. I think an option like the one suggested in https://github.com/ibis-project/ibis/issues/10231#issuecomment-2387951607 would be good and/or letting the user set an environment variable to toggle between a fast or detailed version of interactive mode as the default depending on the project.

Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?

  1. ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference)
  2. ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference)
  3. ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference)
  4. ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference)
  5. ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)
NickCrews commented 2 days ago

environment variable

@lboller-pwbm I've wanted this too. I started #10523 for any discussion of that.

contang0 commented 1 day ago

The way I access my data is usually similar to 1. I tend to start with a connection to Impala, and persist data to memory (duckdb) as soon as it's reasonably small. This is why seeing backend name would also be useful for me. Sometimes I get lost at which step I persisted the data to a memtable.

Do y'all have strong opinions on what the default behavior should be (I imagine since you are in this issue you are going to be a non-representative sample of the ibis user community)? Can you comment at all on what sort of workload you are typically doing, one of the below, or a different one?

ibis.duckdb.connect("mydb.db").table("my_table") (I expect ~0 difference) ibis.duckdb.connect().read_parquet("t.pq") (I expect~0 difference) ibis.duckdb.connect().read_csv("thousand_rows.csv") (I expect small difference) ibis.duckdb.connect().read_csv("billion_rows.csv") (I expect large difference) ibis.duckdb.connect().read_csv("thousand_rows.csv").some_expensive computation() (I expect a difference, size depends on semantics of the function)