hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 544 forks source link

[1.0.0] CTGAN Optimization #77

Open MooooCat opened 10 months ago

MooooCat commented 10 months ago

Problem

When large amount of real data is used to train a CTGAN model, the current implementation is not working well.

Since all the data (DataFrame) is loaded into the memory when training, this will cause huge memory consumption, which is not an elegant solution.

Proposed Solution

Fortunately, in this refactoring, sdgx provides the new DataLoader and the NDArryLoader under development.

We can use these new data-related components to modify the Data transformer, Data sampler, and CTGAN model.

The data will not be loaded into the memory all at once, instead, the data will be loaded into the memory in rows or columns (chunks) according to needs, then the data will be used to train the model.

This will effectively reduce memory consumption and provide larger data processing capabilities.

Additional context

TBD

Wh1isper commented 10 months ago

CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.

Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.

MooooCat commented 10 months ago

CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.

Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.

In response to this problem, I will start the design of metadata and data processor, and update it in the issue or descussion section.