Summary
Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction.
The majority of the time spent for serialization is spent on the following:
Constructing intermediate DataFrames
Setting the index:
By removing the intermediate DataFrame construction (so we only use numpy arrays and construct the DataFrame at the very end) and constructing the index separately, we can speed the serialization up significantly.
Performance comparisons
No Index - Series
df = pd.Series(range(100))
a = FrametoArraySerializer().serialize(df)
Single Chunk:
Multiple Chunks (data in list):
With Index - Series
df = pd.Series(range(100), index=pd.Index(range(100), name='A'))
a = FrametoArraySerializer().serialize(df)
Single Chunk:
Multiple Chunks (data in list):
No Index - Multiple columns
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
columns=list('ABCD'))
a = FrametoArraySerializer().serialize(df)
Summary Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction. The majority of the time spent for serialization is spent on the following:
By removing the intermediate DataFrame construction (so we only use numpy arrays and construct the DataFrame at the very end) and constructing the index separately, we can speed the serialization up significantly.
Performance comparisons
No Index - Series
With Index - Series
No Index - Multiple columns
Single chunk (shape: (100, 4))
Multiple chunks (data in list):
With Index - Multiple columns
Single chunk (shape: (100, 4))
Multiple chunks (data in list):