gretelai / gretel-synthetics

Synthetic data generators for structured and unstructured text, featuring differentially private learning.
https://gretel.ai/platform/synthetics
Other
579 stars 87 forks source link

Attribute distribution not replicated [FR / BUG] #169

Open evasahlholdt opened 3 months ago

evasahlholdt commented 3 months ago

Hi Gretel team,

I am working with your implementation of DGAN.

Specifically, I am exploring how well DGAN handles generating temporal EEG. My objective is to create synthetic EEG data for a motor imagery task with 64 EEG channels (time-varying features), three conditions (left or right hand imagery, or rest) and 100 participants. I include condition and subject_ID as attributes, and voltage measurements in the 64 channels as features.

I am currently testing DGAN on a reduced dataset to reduce training time, which includes: 3 channels specifically relevant for motor imagery (features) 5 participants (5 classes) 3 conditions (3 classes)

My feature vector for training includes 360 samples of length 656 and 3 features: (360, 656, 3). My attribute vector for training includes string variables for subject_ID and conditions: (360, 2). With five participants and three conditions, I have 15 possible attribute combinations. In the reduced training set (and more or less in the full training set), they are perfectly balanced. That means that I have 24 examples of EEG signals that relate to a specific condition for a specific subject.

However, I have problems recreating the attribute distribution. Considering that my training data is balanced, I would expect DGAN to create an approximately equal number of samples for each attribute combination. Instead, I see a large unbalance in my data. For example, generating 360 samples should yield approximately 24 samples for each combination, but instead gives me e.g. 100 samples for one combination, and 5 or fewer for other combinations.

After looking through the documentation and source code, I have not been able to figure why this problem appears, and as I understand it, it is not yet possible to specify the attribute distribution explicitly.

Below is the current implementation (note that I have not yet tuned hyperparameters; currently working with the reduced dataset for exactly that objective)

train_features.shape: (360, 656, 3)

train_attributes.shape: (360, 2)

Initiate, train, and generate from DGAN model

model = DGAN(DGANConfig(
    max_sequence_len=train_features.shape[1], # 656 
    sample_len=16, 
    batch_size=32, 
    apply_feature_scaling=True, 
    use_attribute_discriminator=True, 
    apply_example_scaling=False, # to mitigate issues with unrealistic feature ranges
    normalization=Normalization.MINUSONE_ONE, 
    attribute_loss_coef=1, 
    generator_learning_rate=1e-4,
    discriminator_learning_rate=1e-4,
    attribute_discriminator_learning_rate=1e-4,
    epochs=500,
))

# Train
model.train_numpy(
    attributes=train_attributes, 
    features=train_features,
)

# Generate
syn_attributes, syn_features = model.generate_numpy(360)`

I assess that my generated data includes the same attributes as the real data:

Unique subjects in val_data: ['S100' 'S45' 'S59' 'S82' 'S9'] Unique annotations in val_data: ['left' 'rest' 'right'] Unique subjects in syn_data: ['S100' 'S45' 'S59' 'S82' 'S9'] Unique annotations in syn_data: ['left' 'rest' 'right']

I also validate that my synthetic syn_attributes vector has the correct shape of (360, 2).

However, looking into which and how many attribute combinations are present in syn_attribute, it looks like this:

('left', 'S100'): 3 ('left', 'S45'): 1 ('left', 'S59'): 150 ('left', 'S82'): 54 ('left', 'S9'): 8 ('rest', 'S100'): 1 ('rest', 'S59'): 99 ('rest', 'S82'): 31 ('rest', 'S9'): 4 ('right', 'S59'): 8 ('right', 'S82'): 1

Not all combinations are generated (only 11 out of 15), and the counts are completely off. Moreover, which combinations are created and how many changes for every new training instance. I see no improvements in the relative count balance when generating many samples (e.g. 20.000).

I have tried:

Do you have any experience with a similar issue?

I don't see why the relatively small sample size should affect the attribute distribution, when the data is balanced - however, could it be a reason for the poor performance in attribute generation?

And do you have thoughts on relatively simple way to implement generating samples with a prespecified attribute distribution? (I saw you suggest someone else trying something in this line, and thought you might have pointers to how it could be approached).

Any suggestions are welcome,

Eva

mckornfield commented 1 month ago

I looked through the other issue related to generating synthetics from biosignals https://github.com/gretelai/gretel-synthetics/issues/162

And it seems like a few things stand out

  1. I would assume with that sequence length that something might be going awry, are you divide up your sequences to be smaller?
  2. The sample count Kendrick mentions in the other issue that he says are necessary are in the 10k range, so I think the amount that you're working with won't be sufficient
  3. For hyperparameters, since there's no effect with them, larger sample sizes or varying the sequence length might be the way to go