plotly / plotly.py

The interactive graphing library for Python :sparkles: This project now includes Plotly Express!
https://plotly.com/python/
MIT License
15.65k stars 2.51k forks source link

A `validate=False` an option for `graph_objects` and `px` Figures? #1812

Open michaelbabyn opened 4 years ago

michaelbabyn commented 4 years ago

There's already an issue outlining the effects graph_object validation has on plot generation time. Users can bypass this performance hit by replacing the graph_objects with dict and then display the plot with plotly.offline.iplot(fig, validate=False) or if they are creating graphs in Dash, they can forgo the plotly.py library altogether and just use a dict in their Graph component's figure argument.

This solution can greatly improve the performance of Dash apps but it means that Dash users with expensive graphs have to choose between using px/plotly.py's update methods and optimally fast code.

I wonder if a way to turn off validation, especially in Dash apps, would help Dash users get the best of both worlds.

cc @matthewchan15

emmanuelle commented 4 years ago

Also related to https://community.plot.ly/t/plotting-large-number-of-graphs/35907.

emmanuelle commented 4 years ago

To be checked: can we do this and still keep the magical underscore methods?

Also possible: half-way point where we would disable the validation of only data arrays.

Note that the "import" time is a big part of the lag when developing

parksj10 commented 4 years ago

Any update on this? certainly have my +1, using large data sets with datashader and it's taking seconds to validate. Likely will have to retrofit my code with the dict methods :(

nicolaskruchten commented 4 years ago

@parksj10 can you confirm you’re seeing performance issues with a version of plotly of 4.7 or higher? We made a number of performance improvements in 4.7 so I just want to make sure :)

parksj10 commented 4 years ago

@nicolaskruchten running plotly 4.8.1, I've attached a cProfile below, you can see that half the figure generation time is spent validating. In case you're interested, I've also attached the cProfile .dat file. Let me know if I can do anything else to help or provide other information. I think it would be rather difficult to create a low-complexity, working example from my app, but perhaps @michaelbabyn 's examples could be useful in this regard

Screen Shot 2020-06-25 at 7 41 04 PM

temp.dat.zip

nicolaskruchten commented 4 years ago

Thanks! This is something we should fix, and we’d appreciate any help :)

ndrezn commented 1 year ago

I'm running into this, a few years later 🙂. This causes major issues when working with e.g. choropleth maps with large GeoJSON files, where you will end up with giant JSON blobs that certainly do not need to be validated.

I imagine this is a pretty common issue for folks working with charts with many points, and I had no idea this was even a thing until today. It'd be great at least to document this behaviour or make people more aware of it until it's possible to disable validation. Maybe even on https://plotly.com/python/webgl-vs-svg/?

alexcjohnson commented 1 year ago

I like the idea of a three-level approach: full validation (current behavior), top-level validation (don’t dig into data arrays or nested objects like GeoJSON), and no validation.

ndrezn commented 1 year ago

(want to note as well that I'm seeing ~1second validation time/mb of object. With GeoJSONs, we often see blobs in the size of 60mb+, which just destroys your app performance.)

Having the top-level validation option seems perfect!

nicolaskruchten commented 1 year ago

So independently of the validation issue, if the GeoJSONs are static, you should always load them from assets in a Dash app, for caching purposes. Basically just pass in the URL rather than the GeoJSON blob.

nicolaskruchten commented 1 year ago

Having the top-level validation option seems perfect!

Yes, of course, although the last time we tried, we were unable to make it work :)

ndrezn commented 1 year ago

@nicolaskruchten -- yes, I'm able to mostly get around this issue by using OperatorTransform from Dash Extensions and combining that with using objects to define my Dash apps. Adding to assets/ would make it even better though... great idea.

My main concern here is that this isn't intuitive, and it's also not intuitive that you can boost performance of figures in Dash apps with a large number of points just by switching how they are defined (which is why it'd be great to at least see this behaviour documented).

ndrezn commented 1 year ago

(cc @red-patience / @LiamConnors on that last point maybe)

hannahker commented 1 year ago

Throwing my support behind this one! Even if it takes some time to add in a validate=False param, in the meantime it would be really helpful to have documentation to alert people that this might be a bottleneck in chart performance and that you can work around it with creating the dict directly.

Both this trick and passing data as a static asset url have massively improved the performance of my graph and I wouldn't have known to do either of these things if I hadn't been pointed towards this issue.

cc @red-patience

bmaranville commented 1 year ago

I think I have a related issue affecting subplots.make_subplots, where the time to execute increases non-linearly with the number of plots. For a 20x20 grid of plots it is taking 14 seconds, for a 21x21 grid it takes 18 seconds, for example. This is for an empty figure, which is created with make_subplots e.g.

from plotly.subplots import make_subplots

%time fig = make_subplots(rows=20, cols=20)

From profiling, it is spending the vast majority of its time in the _ret function of basedatatypes.py, and all of the time in that function is spent in find_closest_string, which I think is because it is pre-calculating an error message for a missing key - which is related to the validation. There would be a > 90% speedup if validation could be disabled, from what I can see in the profiling.

EDIT: I think I will make a new issue for this: see #4100

nicolaskruchten commented 1 year ago

Thanks for that profiling! We could probably speed things up by only computing error strings when we know there's an error...