prutskov commented 4 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Modin version: 0.8.0
Python version: 3.8.2

Describe the problem

Current implementation of DataFrame.unstack and Series.unstack is slightly faster than default_to_pandas.

Source code / logs

Measurements results:	DataFrame shape	Pandas time, s	Modin time, 112 cores, s	Modin time (default to pandas), 112 cores, s	Modin time, 56 cores, s
50000x16, ~2mb	0.27	0.86	1.53	0.54	1.01
50000x128, ~21mb	0.32	1.93	8.35	1.33	4.15
500000x16, ~29mb	0.43	4.33	3.92	2.65	2.53
500000x128, ~211mb	1.07	15.27	21.38	10.78	11.71
500000x256, ~419mb	1.77	27.66	39.06	17.09	21.98

Script to measure

```python import numpy as np import os import pandas from timeit import default_timer as timer RAND_LOW = -100 RAND_HIGH = 100 N = 50000 M = 128 MULTILINE = False TEST_FILENAME = os.path.abspath( f"int_dataset-{N},{M},{RAND_LOW},{RAND_HIGH},{MULTILINE}.csv" ) def generate_data_file(filename, row_n, col_n, multiline_rows=False): data = { f"col{i}": np.concatenate( [ np.concatenate( [ ["some\nvery very very\nlong string\nwith many multilines"], np.random.randint(RAND_LOW, RAND_HIGH, 9), ] ) for _ in np.arange(row_n // 10) ] ) if (i % 10 == 0 and multiline_rows) else np.random.randint(RAND_LOW, RAND_HIGH, row_n) for i in np.arange(col_n) } print("dict generated!") df = pandas.DataFrame(data) print("dataframe created!") df.to_csv(filename) print("csv ready!") def multiIndex_generator(df, axis=0): if axis == 0: df.index = pandas.MultiIndex.from_tuples( [(j, i) for j in np.arange(10) for i in np.arange(len(df.index)/10)] ) else: df.columns = pandas.MultiIndex.from_tuples( [(0, i) for i in np.arange(len(df.columns))] ) return df if __name__ == "__main__": import modin.pandas as pd if not os.path.exists(TEST_FILENAME): generate_data_file(TEST_FILENAME, N, M, MULTILINE) md_df = multiIndex_generator(pd.read_csv(TEST_FILENAME)) pd_df = multiIndex_generator(pandas.read_csv(TEST_FILENAME)) print( f"DataFrame shape: ({N}, {M}) ~ {os.stat(TEST_FILENAME).st_size // (1024 * 1024)}MB, {pd.DEFAULT_NPARTITIONS} cores" ) t1 = timer() res = repr(pd_df.unstack()) print("PD unstack:", "{:.2f}".format(timer() - t1), "s") t1 = timer() res = repr(md_df.unstack()) print("MD unstack:", "{:.2f}".format(timer() - t1), "s") ```

prutskov commented 4 years ago

Additional performance information described here

YarShev commented 4 years ago

modin-project / modin

Increase performance of `unstack` function #1975

System information

Describe the problem

Source code / logs

2086