Histogram with large data (vaex instead of pandas)

plotly / plotly_express

Plotly Express - Simple syntax for complex charts. Now integrated into plotly.py!

https://plot.ly/python/plotly-express/

MIT License

4 stars 0 forks source link

Histogram with large data (vaex instead of pandas) #139

Closed maartenbreddels closed 4 years ago

maartenbreddels commented 5 years ago

Hi,

great project and it was on my wishlist to try it out with vaex (an out of core dataframe alternative to pandas), which I've done here: https://github.com/vaexio/vaex/pull/383

However, when I try the histogram:

h = px.histogram(df, x='VendorId)'

I notice that plotly asks vaex for that data (150 million rows), and adds the data to the plotly Histogram object. Trying to send that to the browser will fail (it will crash chrome). I was expecting plotly express to do a groupby (which vaex then will handle instead of pandas), and only send the aggregated data. Is this a bug or a feature, and is this likely to change?

Regards,

Maarten

jonmmease commented 5 years ago

Hi @maartenbreddels, thanks for checking out the project!

At the moment, this limitation is by design because the px.histogram function maps directly to the plotly.js histogram trace type, which does all of the binning on the JavaScript side.

I think it would be nice to have some kind of server_side option to perform the binning in Python. In that case, we would display the results using a bar trace. If we implemented all of the bin function options, we might be able to make this the default.

@nicolaskruchten interested in your thoughts on this when you're back in the office

maartenbreddels commented 5 years ago

Hi Jon!

ok, good to know, so that would only be the exception I assume?

cheers,

Maarten

nicolaskruchten commented 5 years ago

Hi both, sorry for the delay in responding.

I agree that for this kind of thing, leveraging "server-side" Python is a clear winner. My philosophy with px initially had been that it would do as little work as possible server-side, so as to provide a more coherent wrapper around plotly.js, but this kind of thing is an obvious limitation of this approach. The downside of implementing a server_side flag to the aggregating trace types like px.histogram but also px.density_heatmap, px.density_contours and even px.box possibly, is that we have to implement in Python something which is ideally identical to the behaviour of the underlying JS library, which is a challenge, but also that it's not clear what trace type to use in the output. If we do server-side aggregation in px.histogram, do we then produce a figure composed of bar trace types? That seems a bit weird...

nicolaskruchten commented 4 years ago

Migrated over to the main plotly repo: https://github.com/plotly/plotly.py/issues/2649