kathrinse / be_great

A novel approach for synthesizing tabular data using pretrained large language models
MIT License
286 stars 47 forks source link

How to randomly select a column for preconditioning for each epoch? #40

Open BelenGarciaPascual opened 1 year ago

BelenGarciaPascual commented 1 year ago

As explained in the code documentation, when training/fine-tuning by the function .fit(), if no column in the tabular data is specified, the last column is taken to precondition.

We have used several metrics from the SDMetrics library to compare the real California housing dataset to several synthetically generated tabular datasets from GReaT. From there, we notice that this default preconditioning is not ideal, as that last column is almost an exact match to its synthetic column, while the rest of the columns present much more variability.

We would really like to mitigate this effect by selecting a column randomly, where every column has equal probability to be selected, and this column selection is repeated every time the data is revisited, i.e, every number of epochs. Like this, all columns would be selected for preconditioning during the fitting process.

Is it possible, in any argument in GReaT() or in .fit(), to specify this random selection of columns switching every epoch? Or one has to somehow re-write the code and not use the be_great package directly?

Many thanks beforehand!

unnir commented 1 year ago

Hi @BelenGarciaPascual!

Unfortunately, our framework does not support it yet, but we will incorporate random preconditioning for each epoch in our future release.

If someone wants to contribute, we will be happy to merge a PR.

kontramind commented 1 year ago

Hi @unnir,

I'm working together with @BelenGarciaPascual. We have a workaround for the issue above. Basically we compute steps in a way that takes into account dataset size so we align saving of checkpoint with a single epoch. Then we start new epoch but every time we switch column on which we condition. We have already seen improvements (using distilgpt2 backend) when compared with SDV's GaussianCopula.

We are also planning to make a proper PR.

```python
batch_size = 32
steps = len(data)//batch_size

epochs = [0,1,2,3,4,5,6,7]
columns = data.columns

for epoch in epochs:
    for idx, column in enumerate(columns):
        print(f'{epoch=} -> {column=}')
        great = GReaT(base,                                 # Name of the large language model used (see HuggingFace for more options)
              batch_size=batch_size,
              epochs=epoch*len(data.columns) + idx + 1,   # Number of epochs to train (only one epoch for demonstration)
              save_steps=steps,                            # Save model weights every x steps
              logging_steps=steps,                         # Log the loss and learning rate every x steps
              experiment_dir=f"aleks_{llm}_trainer",       # Name of the directory where all intermediate steps are saved
        )

        if epoch == 0 and  idx == 0:
            trainer = great.fit(data, conditional_col=column)
        else:
            trainer = great.fit(data, conditional_col=column, resume_from_checkpoint=True)
            rmtree(Path(f"aleks_{llm}_trainer")/f"checkpoint-{epoch*len(data.columns)*steps + idx*steps}")

        great.save(f"aleks_california_{llm}")

        for path in Path(f"aleks_{llm}_trainer").iterdir():
            if path.is_dir():
                print(f'{path=}')
unnir commented 1 year ago

Cool!

Thank you for the update, I would recommend to train the model longer, at least 10+ epochs to get even better results.

Also, to speed the training you can pass the fp16=True to the GReaT method. It should be at least 2 times faster.

Divjyot commented 1 year ago

@unnir Just curious, how long it take for you to fine tune model on your hardware (I assume you are using some GPU)

nphdang commented 7 months ago

Hi @BelenGarciaPascual and @unnir, I am interested in this discussion. However, as described in the paper, after converting each row to a textual encoding, Great permutes the sequence to ignore the order. So, I don't understand what did you mean by saying the last column was used in the training/fine-tuning the model. In my understanding, the last column is only used in the sampling phase if you don't specify the precondition.

kontramind commented 7 months ago

Hi @nphdang,

You are correct: Great permutes the sequences in its own during fine-tuning. Our initial suggestion there was wrong. However, doing the permutations when sampling can help with final result. For example, lets say that you have 10 columns and you want a dataset of 1000 data points. One can iterate, and use each feature to precondition generation of 100 samples.

nphdang commented 7 months ago

@kontramind thanks for the clarification. Yes, doing permutation in the sampling phase is simpler, we just need to iterate each column and set it as the precondition. I tried this step and it could slightly improve in the classification downstream task.