Speed up batch generation for tall data frames

octoenergy / timeserio

Better `keras` models for time series and beyond

MIT License

61 stars 16 forks source link

Speed up batch generation for tall data frames #27

Closed ig248 closed 3 years ago

ig248 commented 3 years ago

This can be literally a million times faster for large dataframes as the .iloc is zero-copy, even if we perform column sub-selection after iloc.

Update

@ali-tny regarding wide dataframes, I have not managed to find an example where the new change is slower. e.g.:

n_rows, n_cols = (10_000, 100_000)
n_selected_columns = 10_000
df = pd.DataFrame(np.ones((n_rows, n_cols)))

columns = np.random.choice(range(n_cols), n_selected_columns)

%time df[columns].iloc[:10]
%time df.iloc[:10][columns]

yields sth like

CPU times: user 1.84 s, sys: 5.21 s, total: 7.05 s
Wall time: 8.01 s
CPU times: user 4 ms, sys: 2.87 ms, total: 6.87 ms
Wall time: 6.49 ms

ali-tny commented 3 years ago

It looks like view vs copy in pandas is a bit of a quagmire - the documentation doesn't seem to guarantee even that iloc returns a view:

Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

playing around it looks like it depends on whether the dataframe has multiple types in it? which is presumably fairly common.

so - can you confirm this doesn't degrade performance for wide DFs? or perhaps there's a way we could do this in a single indexing operation. Alternatively if neither of the above are true then I'd probably be happy to change it anyway since it was probably a pretty arbitrary decision to do the column selection before in the first place

ig248 commented 3 years ago

This is actually as convoluted as you say :D I am guessing that one of the issues I encountered is that df[df.columns] can be very slow - a common case when the df has only the columns we need in batches. I will try to add a few benchmarks to the PR.

ali-tny commented 3 years ago

also the usecase that rows >>>>> columns is probably vastly more common (especially given that the batching is being done by row) so the wide DF case is unlikely to matter too much