man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.51k stars 93 forks source link

Arcticdb reads can be slow when reading many columns #1963

Open G-D-Petrov opened 3 weeks ago

G-D-Petrov commented 3 weeks ago

Describe the bug

At the moment the post processing step of read scales linearly with the number of columns. This is caused by the calls to make_block and when we have many columns(150k+) it is taking about half of the time of the total read.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
from arcticdb import Arctic
import cProfile

N = 175000
df = pd.DataFrame(np.random.randn(1, N), columns=[f"col_{i}" for i in range(N)])

ac = Arctic("lmdb://test_wide_df?map_size=5GB")
lib = ac.get_library("test_wide_df", create_if_missing=True)
lib.write("test_wide_df", df)

cProfile.run("lib.read('test_wide_df')")

Expected Results

The post processing step should take too much time

OS, Python Version and ArcticDB Version

Python 3.10

Backend storage used

LMDB

Additional Context

No response