hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Performance issues with GaussianCopula training on tabular data #194

Open jalr4ever opened 4 months ago

jalr4ever commented 4 months ago

Problem

When dealing with tabular data at the scale of millions of rows and hundreds of columns, the current GaussianCopulaSynthesizer encounters significant memory usage problems (approximately 37 GB on MacOS M3 MAX).

Proposed Solution

A reduction in resource consumption (e.g., achieving around 4 GB of memory usage for the given data case), alongside the capability to train on larger datasets while maintaining good performance.

Additional context

Reproduction Code & Files:

test file: test.csv

data_connector = CsvConnector(path="./test.csv")
    data_loader = DataLoader(data_connector)
    loan_metadata = Metadata.from_dataloader(data_loader)
    model = GaussianCopulaSynthesizer()
    model.fit(loan_metadata, data_loader)
    sampled_data = model.sample(10)
    sampled_data.to_csv("./aaaa.csv", index=False)