Open MooooCat opened 10 months ago
CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.
Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.
CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.
Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.
In response to this problem, I will start the design of metadata and data processor, and update it in the issue or descussion section.
Problem
When large amount of real data is used to train a CTGAN model, the current implementation is not working well.
Since all the data (DataFrame) is loaded into the memory when training, this will cause huge memory consumption, which is not an elegant solution.
Proposed Solution
Fortunately, in this refactoring, sdgx provides the new DataLoader and the NDArryLoader under development.
We can use these new data-related components to modify the Data transformer, Data sampler, and CTGAN model.
The data will not be loaded into the memory all at once, instead, the data will be loaded into the memory in rows or columns (chunks) according to needs, then the data will be used to train the model.
This will effectively reduce memory consumption and provide larger data processing capabilities.
Additional context
TBD