Closed rc-eddy closed 2 months ago
Can you profile the slow path to see where time is spent? I'd recommend snakeviz and line-profiler.
Apparently most time is spent in simply deleting the column before replacing it.
So the time is spent in NumPy deleting the ndarray?
Apparently so, but simply deleting an ndarray is faster - on my machine, about 50ms simply deleting one generated by pd.util.testing.rands_array(10, 10 ** 6)
and about 0.5s for calling np.delete(data, 10, 1)
after data = pd.util.testing.rands_array(10, (10 ** 6, 26))
.
Make sure that the deletes are equivalent. IIUC, we're calling np.delete
on a (n_columns, n_rows)
ndarray, which may not be efficient.
-> self.values = np.delete(self.values, loc, 0)
(Pdb) self.values.shape
(26, 1000000)
(Pdb) loc
array([8])
These look fairly equivalent now in terms of performance. Could use an benchmark
In [1]: import numpy as np
...: import pandas as pd
...: from string import ascii_uppercase
...:
...: df = pd.DataFrame(columns=list(ascii_uppercase),
...: data=pd.util.testing.rands_array(10, (10 ** 6, 26)))
...: arr = np.random.randint(1, 10, 10 ** 6)
/Users/matthewroeschke/pandas-mroeschke/pandas/util/__init__.py:15: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing
In [2]: %timeit df['I'] = arr
3.09 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit df['J'] = arr.astype(object)
11.2 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
take
I have tried to use different methods of replacing the string column but coercing has consistently given me the fastest wall times.
I have tried to use different methods of replacing the string column but astype has consistently given me the fastest wall times. Or maybe I'm missing the whole point
import numpy as np import pandas as pd from string import ascii_uppercase
df = pd.DataFrame(columns=list(ascii_uppercase), data=pd.util.testing.rands_array(10, (10 6, 26))) arr = np.random.randint(1, 10, 10 6)
%%time df['I'] = arr
%%time df['L'] = df.infer_objects()
%%time df['J'] = arr.astype(object)
%%time df['K'] = pd.to_numeric(arr)
%%time pd.to_numeric(df['M'],errors='coerce') df['M']=arr
Code Sample
Problem description
For some reason, replacing string columns with integers in large dataframes seems extremely slow.
Expected Output
Same for both cases, e.g.
Wall time: 69 ms
Output of
pd.show_versions()