Performance issues with GaussianCopula training on tabular data

Problem

When dealing with tabular data at the scale of millions of rows and hundreds of columns, the current GaussianCopulaSynthesizer encounters significant memory usage problems (approximately 37 GB on MacOS M3 MAX).

Proposed Solution

A reduction in resource consumption (e.g., achieving around 4 GB of memory usage for the given data case), alongside the capability to train on larger datasets while maintaining good performance.

Additional context

Reproduction Code & Files:

test file: test.csv

data_connector = CsvConnector(path="./test.csv")
    data_loader = DataLoader(data_connector)
    loan_metadata = Metadata.from_dataloader(data_loader)
    model = GaussianCopulaSynthesizer()
    model.fit(loan_metadata, data_loader)
    sampled_data = model.sample(10)
    sampled_data.to_csv("./aaaa.csv", index=False)

hitsz-ids / synthetic-data-generator

Performance issues with GaussianCopula training on tabular data #194

Problem

Proposed Solution

Additional context