pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

Performance issue when replacing a sting column with integer #26889

Closed rc-eddy closed 2 months ago

rc-eddy commented 5 years ago

Code Sample

import numpy as np
import pandas as pd
from string import ascii_uppercase

df = pd.DataFrame(columns=list(ascii_uppercase),
                  data=pd.util.testing.rands_array(10, (10 ** 6, 26)))
arr = np.random.randint(1, 10, 10 ** 6)
%%time
df['I'] = arr
# Wall time: 2.46 s
%%time
df['J'] = arr.astype(object)
# Wall time: 69 ms

Problem description

For some reason, replacing string columns with integers in large dataframes seems extremely slow.

Expected Output

Same for both cases, e.g. Wall time: 69 ms

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.4 pytest: 3.5.1 pip: 19.0.2 setuptools: 39.1.0 Cython: 0.28.2 numpy: 1.15.2 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.5.0 sphinx: 1.7.4 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: 1.2.1 tables: 3.4.3 numexpr: 2.6.5 feather: None matplotlib: 3.0.0 openpyxl: 2.5.3 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.4 lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.7 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
TomAugspurger commented 5 years ago

Can you profile the slow path to see where time is spent? I'd recommend snakeviz and line-profiler.

rc-eddy commented 5 years ago

Apparently most time is spent in simply deleting the column before replacing it.

TomAugspurger commented 5 years ago

So the time is spent in NumPy deleting the ndarray?

rc-eddy commented 5 years ago

Apparently so, but simply deleting an ndarray is faster - on my machine, about 50ms simply deleting one generated by pd.util.testing.rands_array(10, 10 ** 6) and about 0.5s for calling np.delete(data, 10, 1) after data = pd.util.testing.rands_array(10, (10 ** 6, 26)).

TomAugspurger commented 5 years ago

Make sure that the deletes are equivalent. IIUC, we're calling np.delete on a (n_columns, n_rows) ndarray, which may not be efficient.

-> self.values = np.delete(self.values, loc, 0)
(Pdb) self.values.shape
(26, 1000000)
(Pdb) loc
array([8])
mroeschke commented 3 years ago

These look fairly equivalent now in terms of performance. Could use an benchmark

In [1]: import numpy as np
   ...: import pandas as pd
   ...: from string import ascii_uppercase
   ...:
   ...: df = pd.DataFrame(columns=list(ascii_uppercase),
   ...:                   data=pd.util.testing.rands_array(10, (10 ** 6, 26)))
   ...: arr = np.random.randint(1, 10, 10 ** 6)
/Users/matthewroeschke/pandas-mroeschke/pandas/util/__init__.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing

In [2]: %timeit df['I'] = arr
3.09 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df['J'] = arr.astype(object)
11.2 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
trevorkask commented 3 years ago

take

trevorkask commented 3 years ago

I have tried to use different methods of replacing the string column but coercing has consistently given me the fastest wall times.

trevorkask commented 3 years ago

I have tried to use different methods of replacing the string column but astype has consistently given me the fastest wall times. Or maybe I'm missing the whole point

import numpy as np import pandas as pd from string import ascii_uppercase

df = pd.DataFrame(columns=list(ascii_uppercase), data=pd.util.testing.rands_array(10, (10 6, 26))) arr = np.random.randint(1, 10, 10 6)

%%time df['I'] = arr

Wall time: 837 ms

%%time df['L'] = df.infer_objects()

Wall time: 5.17 s

%%time df['J'] = arr.astype(object)

Wall time: 171 ms

%%time df['K'] = pd.to_numeric(arr)

Wall time: 1.59 s

%%time pd.to_numeric(df['M'],errors='coerce') df['M']=arr

Wall time: 1.49 s