pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.86k stars 1.92k forks source link

Discuss: LazyFrame.plot? #13339

Open MarcoGorelli opened 9 months ago

MarcoGorelli commented 9 months ago

I'm just breaking this off from https://github.com/pola-rs/polars/pull/13238

Here's the comments made so far:

@MarcoGorelli :

Doing .collect within a plotting function sounds potentially like a footgun, and a risk that people re-trigger the same expensive IO function over and over again? I'd suggest, at least to start, to only have .plot on DataFrame, so people are expected to collect before plotting

@MarcSkovMadsen

But if plotting requires running some operations for example if using 'by' argument, making boxplot, histogram or using datashader, would you not want the plotting library to control when things are collected?

(I'm not the expert)

@MarcoGorelli

I'd suggest moving incrementally:

  1. start with DataFrame.plot (based on comments/reactions so far, fairly uncontroversial?)
  2. then, discuss LazyFrame.plot later, as I think that's more complicated (see https://github.com/pola-rs/polars/issues/5508#issuecomment-1857646174 for an example of how it's possible to collect-shoot yourself in the foot). I'd like to see at least some discussion of collect-footguns in the context of plotting before adding LazyFrame.plot. But let's save that for a future PR / discussion if it's OK

@ritchie46

then, discuss LazyFrame.plot later

Yes.. that's a big no-no for me.

@jbednar

I don't have experience with using Polars myself, but .collect sounds like it might be conceptually similar to Dask .compute()? For Dask we keep it lazy all the way through to plotting so that tools like Datashader can do the actual rendering in a fully distributed fashion, with chunks of each plot rendered to an image or array in parallel and only the final rendered chunk needing to cross the wire out of each remote worker. Datashader doesn't currently support Polars, but I'd hope it will some day, and so it would be good to choose an approach here that anticipates having that work smoothly.

@MarcoGorelli

Adding DataFrame.plot now doesn't preclude adding some LazyFrame.plot solution later on, possibly with some safeguards (e.g. a "I know what I'm doing" kind of flag)

@Hoxbro

I would make one suggestion of adding a plot accessor to the LazyFrame, with an error or warning which says plotting is not supported for LazyFrame, please call .collect() beforehand.


I did discuss on API kind of like

(
    pl.scan_parquet(file)
    .plot.line(x='x', y='y', by='by')
    .collect()
)

with Ritchie, so that predicate pushdown can happen and you only read from the parquet file the columns you need to make the plot.

There is a risk though that people will repeatedly make cells like this and so re-trigger the scan_parquet part multiple times, and plots in particular tend to incentivize interactive development.

But, for running jobs which create reports, or for plotting larger-than-memory datasets, this could unlock value.

One to think about. This would require some precursor work in hvPlot before it could be added to Polars

mcrumiller commented 9 months ago

I think this is not worth the effort. If someone wants to create a plot, just materialize and then plot, and it seems like it would be a lot of work for a probably not-common case.

MarcSkovMadsen commented 9 months ago

Remember that the plots are not static. They are interactive if you select Bokeh or Plotly backend. They might even contain widgets to filter the data shown. This enables browsing through data that are larger than can be shown in a browser or even can be hold in memory.

When you use .plot the plots are not materialised, they are just configurations/ definitions. As a user, you might continue changing the color, the kind of plot, filter to a subsection of data.

The plot is not materialized before its the output of a cell in a notebook, .save is run, .show is run etc.

I dont understand why its risky not to materialize explicitly before or after .plot. When .plot is run you are no longer in Polars land. You are in the land of the plot framework. Its should have the chance to control things.

MarcoGorelli commented 9 months ago

Thanks for explaining

What I'm concerned about is the risk that someone does

df = (
    pl.scan_parquet(...)
    .group_by(...)
    .#more expensive operations
)

and then, in the next cell of their notebook,

df.plot.line(x='x', y='y')

The plot is shown, and then they realise that, actually, they want to colour by plant species, and so they do

df.plot.line(x='x', y='y', by='by')

in the next cell.

If within .plot a collect was called, then like this, the user would be re-triggering the whole DAG, including scan_parquet, group_by, and the other expensive operations

Does this make sense / seem like a valid concern? Or would there be a way for hvplot to change the plot to be coloured by plant species once the first plot has already been drawn, without having to re-trigger the whole DAG?

mkleinbort commented 9 months ago

Just to reply to:

What I'm concerned about is the risk that someone does...

This is a common situation, and I think we should trust users.

It is possible that any given user will make this mistake once or twice, but if the computation truly is slow, they will soon figure out they can first collect the frame and then tinker with their plot.

That said, I have no idea how the plotting frameworks could be enabled to use Polars' processing capabilities - this seems like a strong entangling of the libraries that would require updates with every API change, but I remain open minded.

MarcoGorelli commented 9 months ago

Thanks for your input!

I'd like to think that there's a solution somewhere in between

but don't yet know exactly what it would be. Happy to discuss more and have a think, maybe we could have a group call. It's good to take our time with this one

What impresses me about hvplot is that they currently do

self._data.select(columns).collect()

for their hvplot accessor - this is exactly the kind of code which allows for the ~predicate~ projection pushdown to happen (as opposed to collecting upfront and then selecting columns)

philippjfr commented 9 months ago

for their hvplot accessor - this is exactly the kind of code which allows for the predicate pushdown to happen (as opposed to collecting upfront and then selecting columns)

In the long run I'd like to push things even further down, i.e. HoloViews should implement a full Polars interface, that keeps things lazy until the last possible moment. This would allow things like histogram calculations, bar plot aggregations, and ideally datashader support to never materialize the full data in memory.

As for what to do about people accidentally generating plots and are infeasible either because they trigger huge amounts of computation or a silly amount of categories, that's a problem we have to find a more general solution for at the hvPlot level. The unfortunate problem that we've always had is that if the data is huge then even a len(df) or a df[col].unique() call can significantly slow things down and always paying that cost is annoying. In the end it's something that we have to do though and we'll just have to very carefully consider where to add these checks.

stinodego commented 8 months ago

I've thought about this a bit, and it's similar to https://github.com/pola-rs/polars/issues/13928

Materializing might be unnecessarily expensive if we're just trying to show some aggregations in a plot.

With that in mind, adding LazyFrame.plot is probably a good idea, if lazy computation is actually supported by the plotting backend. If the backend just does .collect().plot(...), then I would be against adding this.

MarcoGorelli commented 8 months ago

I think https://github.com/holoviz/holoviews/issues/5939 maybe should be considered a necessary precursor. Because with that, the collect could be done a lot further down (rather than currently where it's done just before converting to pandas, so at best you save reading a few columns. Better than nothing, and nice that hvplot does this, but before being available in polars.LazyFrame itself, the lazy support should probably be complete)