innobi / pantab

Read/Write pandas DataFrames with Tableau Hyper Extracts
BSD 3-Clause "New" or "Revised" License
109 stars 43 forks source link

Question on writing to .hyper - exponentially longer write times? #380

Open skyth540 opened 2 hours ago

skyth540 commented 2 hours ago

Describe the bug As I append more and more to a hyper, the write time to do so goes up and up. I have a large folder with parquet files with roughly the same amount of data /size. I iterated through the folder, and as the hyper file got larger and larger from appending the data, the write times per file went from about a minute, to over 20 minutes. Is this expected? I'm curious as to why this happens

To Reproduce

folder_path = r'G:\PATH\Parquet'
hyper_path = r'G:\PATH\test.hyper'
params = {"default_database_version": "1"}

counter = 1

for file_path in glob.glob(f"{folder_path}/*.parquet"):
    print(f"({counter} / {len(glob.glob(f"{folder_path}/*.parquet"))}) Processing {os.path.basename(file_path)}:")
    hist_fct = pl.scan_parquet(file_path)
    hist_fct = hist_fct \
        .join(hist_prdc_ref, on = 'PRODUCT KEY', how = 'inner') \
        .join(hist_prd_ref, on = 'Period Key', how = 'inner') \
        .join(hist_mrkt_ref, on = 'Market Key', how = 'inner') \
        .with_columns(pl.col(pl.Float32).cast(pl.Float64))

    pt.frame_to_hyper(hist_fct.collect(), hyper_path, table = 'table', table_mode = 'a', process_params = params)
    counter += 1

Expected behavior Writing takes the same amount of time regardless of inital hyper size

WillAyd commented 2 hours ago

If you were to just continually overwrite the file does the time increase at all?

To ensure that pantab can perform an "atomic" write, the append mode actually copies the existing hyper file to the temp folder, tries to append, and then restores from the temp folder if the append fails. It is possible that the time to copy that backup file is what causes the runtime growth

WillAyd commented 2 hours ago

FWIW instead of looping in Python you could also just write all of your data through pyarrow's recordbatchreader. Something like:

import pyarrow as pa
import pantab as pt

ds = pa.dataset.dataset(folder_path)
rdr = pa.RecordBatchReader.from_batches(ds.schema, ds.to_batches())
pt.frame_to_hyper(rdr, "example.hyper", table="test")

The downside to this approach is you may also need to provide an explicit schema to the dataset call (assuming you can't use the Hyper database version that supports floats) and the joins will have to be later. But from a size/volume perspective its an easy way to iterate and avoid memory exhaustion

skyth540 commented 1 hour ago

with table_mode = 'w', the writing takes the same time for each file.

With recordbatchreader, when would I do the joins? I also need to do some more complex manipulation like pivoting for some projects

WillAyd commented 1 hour ago

with table_mode = 'w', the writing takes the same time for each file.

Hmm OK that's interesting. I would also be open to adding a keyword that allows for non-atomic appends to avoid copying the file. As far as performance is concerned, you can experiment with that in your installation if you remove anything that has to do with tmp_db here:

https://github.com/innobi/pantab/blob/9a18d6cdded68758a0a72bd27c444de87e3a28b1/src/pantab/_writer.py#L115

Maybe a keyword like atomic=True would be good in the .to_hyper signatures? atomic=False would be faster, but in case of data corruption along the way you would lose your database

With recordbatchreader, when would I do the joins? I

I'm not sure the recordbatchreader itself allows you to join during iteration; if it does that exceeds my knowledge.

You could write all the tables the way they are and leave it to the Hyper database to do the joins:

pt.frames_to_hyper({"data": rdr, "prdc_ref", hist_prdc_ref, ...}, "example.hyper")