man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.06k stars 583 forks source link

Speedup FrametoArray serializer for ChunkStore #909

Closed BaiBaiHi closed 3 years ago

BaiBaiHi commented 3 years ago

Summary Speedup FrametoArray serializer for ChunkStore by removing intermediate DataFrame construction. The majority of the time spent for serialization is spent on the following:

  1. Constructing intermediate DataFrames
  2. Setting the index: image image

By removing the intermediate DataFrame construction (so we only use numpy arrays and construct the DataFrame at the very end) and constructing the index separately, we can speed the serialization up significantly.

Performance comparisons

No Index - Series

    df = pd.Series(range(100))
    a = FrametoArraySerializer().serialize(df)
  1. Single Chunk: image
  2. Multiple Chunks (data in list): image

With Index - Series

    df = pd.Series(range(100), index=pd.Index(range(100), name='A'))
    a = FrametoArraySerializer().serialize(df)
  1. Single Chunk: image
  2. Multiple Chunks (data in list): image

No Index - Multiple columns

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
                      columns=list('ABCD'))
    a = FrametoArraySerializer().serialize(df)
  1. Single chunk (shape: (100, 4)) image

  2. Multiple chunks (data in list): image

With Index - Multiple columns

    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)),
                      columns=list('ABCD'))
    df = df.set_index(['A'])
    a = FrametoArraySerializer().serialize(df)
  1. Single chunk (shape: (100, 4)) image

  2. Multiple chunks (data in list): image

TomTaylorLondon commented 3 years ago

LGTM