sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.23k stars 293 forks source link

Optimize PARSynthesizer's performance #1965

Open srinify opened 2 months ago

srinify commented 2 months ago

Problem Description

A number of SDV users have run into performance issues when using PARSynthesizer with their data. The issues usually manifest as regular out-of-memory errors or CUDA out-of-memory errors. Other times, it just takes a long time to train the model.

I'm creating this thread to collect all of these examples from the community so the SDV core team has the context they need to understand and improve the performance of PARSynthesizer.

For anyone using SDV PARSynthesizer, please add new examples of performance issues to this thread!

srinify commented 2 months ago

Reported Example 1

Out of regular memory error

https://github.com/sdv-dev/SDV/issues/1952 by @prupireddy

RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 683656 bytes. 

"I find this particularly surprising given that I am running this on a machine with 128 GM RAM and I just restarted it."

Suggested Workaround

My recommendation would be to sample the data to reduce the footprint. You can either use less rows per sequence or try less sequences overall. Start with a much lower sample than you think you need (maybe a 5% sample of your data) and then increase by 5% each time to improve the data generated by the model.

srinify commented 2 months ago

Reported Example 2

Out of CUDA memory error

https://sdv-space.slack.com/archives/C01GSDFSQ93/p1713451980542979 by Isaac (Slack)

Use Case: PAR for forecasting time series Scale of data:

Attempted Workarounds:

Example Code (Srini):

import numpy as np
import pandas as pd

# ID column
ids = np.arange(0, 50_000, 1)
ids = np.repeat(ids, 45)

# Sequence Index Column
ticks = np.arange(0, 45, 1)
ticks = np.tile(ticks, 50_000)

# Observations Column
obs = np.concatenate(
    [np.random.normal(loc=5, scale=1, size=1) for i in ids]
)

df = pd.DataFrame(
    {
        "id": ids,
        "ticks": ticks,
        "obs": obs
    }
)

from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)
metadata.update_column(column_name='id', sdtype='id')
metadata.set_sequence_key(column_name='id')
metadata.set_sequence_index(column_name='ticks')

synthesizer = PARSynthesizer(metadata, verbose=True)
synthesizer.fit(df)
liuup commented 2 months ago

Reported Example 1

Out of regular memory error

1952 by @prupireddy

RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 683656 bytes. 

"I find this particularly surprising given that I am running this on a machine with 128 GM RAM and I just restarted it."

Suggested Workaround

My recommendation would be to sample the data to reduce the footprint. You can either use less rows per sequence or try less sequences overall. Start with a much lower sample than you think you need (maybe a 5% sample of your data) and then increase by 5% each time to improve the data generated by the model.

Bro, I recently meet the problem in example 1, how I solve this problem is to modify the segment_size from default to 5 or 10 or bigger which can decrease the calculation time. I don't know if this can help you, but it does works on my computer. And my definition of PARSynthesizer maybe like this:

"""     Step1:    Create the synthesizer    """
synthesizer = PARSynthesizer(
    metadata,
    cuda =  True,
    verbose = True,
    epochs = 512,
    segment_size = 5,
    sample_size = 20,
)

The explaination of segment_size is right here: https://docs.sdv.dev/sdv/sequential-data/modeling/parsynthesizer#:~:text=segment_size,into%20any%20segments. Hope this can help you.

JonathanBhimani-Burrows commented 2 months ago

commenting here to support these suggestion - I run out of memory with 1 mil rows/10 features on a 40 GB GPU is there a reason that there isn't a batch size (is there one and I missed it?) Obviously you can always subsample the data, but this gets more complicated if it has to be done as part of a pipeline, especially if the data is severely unbalanced