plotly / plotly.py

The interactive graphing library for Python :sparkles: This project now includes Plotly Express!
https://plotly.com/python/
MIT License
16.38k stars 2.56k forks source link

Usage of a pandas.Index slows down figure generation of plotly #4250

Closed SergejKr closed 12 hours ago

SergejKr commented 1 year ago

Hello everyone,

I have noticed that the generation of a plotly figure takes ~20 longer if using the index of a pandas DataFrame. I have generated a minimal working example:

import pandas as pd
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from datetime import datetime
import time

size = 10000
df = pd.DataFrame({"col1": np.random.random(size),
                   "col2": np.random.random(size),
                   "col3": np.random.random(size),
                   "col4": np.random.random(size)}, index=pd.date_range(datetime(2020, 1, 1), periods=size))

# Preloads some plotly routines, which would otherwise be in the timing.
fig = make_subplots()

def basic1(data):
    start_time = time.perf_counter()
    # fig = go.Figure()
    fig = make_subplots()
    for col in data:
        fig.add_trace(go.Scattergl(x=data.index, y=data[col], name="ist_mpe"))
    # fig.show("browser")
    print(f"Elapsed time: {time.perf_counter() - start_time}")

# Using the pandas DataFrame index
basic1(df)

def basic2(data):
    start_time = time.perf_counter()
    # fig = go.Figure()
    fig = make_subplots()
    for col in data:
        fig.add_trace(go.Scattergl(x=data.index.values, y=data[col], name="ist_mpe"))
    # fig.show("browser")
    print(f"Elapsed time: {time.perf_counter() - start_time}")

# Manually changing the index to a numpy array
basic2(df)

On my machine I get the following output:

Elapsed time: 0.11946500000000004
Elapsed time: 0.0073545000000000416

The second approach, in which I manually transform the pandas Index to a numpy array, is > 10 times faster. Internally plotly saves data as DataFrames, it seems weird that the use of the index has such an impact.

Note, when using a tz-aware DatetimeIndex, the above approach will firstly transform the dates to UTC+0.

Used Versions:

Package         Version
--------------- -------
DateTime        5.1
numpy           1.24.3
packaging       23.1
pandas          2.0.2
pip             23.0.1
plotly          5.15.0
python-dateutil 2.8.2
pytz            2023.3
setuptools      67.6.1
six             1.16.0
tenacity        8.2.2
tzdata          2023.3
wheel           0.40.0
zope.interface  6.0
MarcoGorelli commented 1 day ago

I think there's just initial cost for the first plot you create, if you repeat each 3 times you see:

# basic1 (with index)
Elapsed time: 0.013026493999859667
Elapsed time: 0.0064328460011893185
Elapsed time: 0.006467880999480258
# basic2 (without index)
Elapsed time: 0.005744940001022769
Elapsed time: 0.006017134999638074
Elapsed time: 0.005981356000120286

I think this can probably be closed then?

SergejKr commented 1 day ago

Hi. In the script above I did a call to reduce the inital cost:

# Preloads some plotly routines, which would otherwise be in the timing.
fig = make_subplots()

I just tried the script again. Once by just executing everything a second time. In that case I obtain the same behaviour as before

basic1(df)
basic2(df)
basic1(df)
basic2(df)
Elapsed time: 0.15516949999999952
Elapsed time: 0.012888099999997848
Elapsed time: 0.14383830000002717
Elapsed time: 0.012077400000009675

I also tried %timeit (commented out the print).

%timeit basic1(df)
144 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit basic2(df)
12 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In all cases basic2 is faster by about a factor 10 than basic1. There must something else slowing my calculation down or some difference in our systems, e.g., the OS (I use Win 10).

MarcoGorelli commented 1 day ago

thanks @SergejKr - I just tried again, and it looks like this is solved in the new release

Plotly 5.24.1:

# with index
Elapsed time: 0.07854408899947884
Elapsed time: 0.07771884500107262
# without index
Elapsed time: 0.005710362998797791
Elapsed time: 0.00547197899868479

Plotly 6.0.0rc0

# with index
Elapsed time: 0.028512741999293212
Elapsed time: 0.006303914000454824
# without index
Elapsed time: 0.007833848998416215
Elapsed time: 0.007394547999865608

I find that if I swap basic2 with basic1, then the first plot to be produced always takes longer. So I think it makes more sense to compare the minimums of several runs of each - if we do that with the pre-release, then the issue looks solved

If you fancied trying out the pre-release, you can install it with pip install -U --pre plotly. If you find and report an issue so it can be fixed ahead of the final release, you're a star ⭐

SergejKr commented 1 day ago

Hello again @MarcoGorelli. I just tried the prerelease. It solves the perfromance issue for me.

basic1(df)
basic2(df)
basic1(df)
basic2(df)
Elapsed time: 0.021018300000378076
Elapsed time: 0.016133499999796186
Elapsed time: 0.02315120000002935
Elapsed time: 0.020329700000274897

Case closed.

MarcoGorelli commented 1 day ago

yay thanks!

gonna ping @LiamConnors on this one then

LiamConnors commented 12 hours ago

Thanks @MarcoGorelli and @SergejKr! Will close this one out