ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.54k stars 555 forks source link

feat: `explain` or similar API for inspecting plans #9276

Open NickCrews opened 1 month ago

NickCrews commented 1 month ago

Is your feature request related to a problem?

This is a bit broad, IDK if this should be a discussion.

I want to be able to profile queries. Specifically, which parts of the query are slow. This is tricky, because in ibis computation happens in two stages:

  1. expression building: This is always fast, since we aren't actually processing any data. But this is where I want to be able to assign blame, ie the table.distinct(on="my_id") call is responsible for 90% of the query runtime. Using timeit or other python profiling is useless here.
  2. runtime: This is where the big data is actually processed, and where the time actually happens.

I can currently do A/B testing, by just timing the whole run, but this is annoying to have to think about what to change, and then writing the change.

What is the motivation behind your request?

I am motived both as an end user, but also as a library author.

Describe the solution you'd like

Ideally every Operation could get associated with a run time, so for instance if you plot an expression graph you could see the runtime next to each node.

That seems very difficult though. Still very usable would be if we just got back the raw, non-machine readable output from eg EXPLAIN ANALYZE <QUERY>. This solution I don't think would even need to be implemented in ibis, but it would be ergonomic if it was a simple expr.profile() or ibis.profile(expr).

I'm not sure about backend support. I don't know, but I imagine most of the SQL backends support some sort of profiling. SQlite profiling looks very different from duckdb profiling.

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

cpcloud commented 1 month ago

Hey Nick!

This is a super interesting problem, but I think it's not feasible for us to correlate an ibis expression with a particular duration (or some proxy of that like instructions, or even database plan node) because the thing that the database executes often looks very different from the thing it was asked to execute.

It's one of the trade-offs associated with using SQL and/or a database: you generally get better performance at the cost of debugability.

That said, what do you think about a backend.explain(expr) method? We had this a while ago but it was removed due to lack of implementation across our backends.

This would at least allow you to get some insight into what the database thinks is happening.

NickCrews commented 1 month ago

what do you think about a backend.explain(expr) method?

I think that sounds nice, even if it returns teh raw string. However, to avoid needing to do a private method on a non-explicit-backend expression

e = ibis.read_csv().<sdfgdsfg>
e._find_backend().explain(e)

I would love to either (1 is my favorite)

  1. have ibis.explain(expr) in addition to backend.explain(expr)
  2. just have expr.explain() (though we don't want to fill up that namespace TOO much)
  3. have expr.explain() in addition to backend.explain(expr)

Possible things to support (duckdb for inspiration)