kathrinse / be_great

A novel approach for synthesizing tabular data using pretrained large language models
MIT License
254 stars 43 forks source link

Issue with running Great on breast cancer dataset #22

Closed ChristinaK97 closed 1 year ago

ChristinaK97 commented 1 year ago

Hello, I've encountered an issue while running Great on the breast cancer dataset from sklearn. While the training process proceeds smoothly without any issues, I have noticed that when attempting to generate samples using the trained model, the script has been running for more than 15 minutes without returning the samples, even for very small n_samples values (the progress bar also appears to be frozen). Same with the Iris dataset. I find this behavior peculiar since Great ran as expected with the California housing dataset. I wanted to ask if you have come across this particular issue before or if you have any suggestions on how to handle it.

Here is my script:

!pip install be-great
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

from be_great import GReaT
import pandas as pd
from sklearn.datasets import load_breast_cancer

realData = load_breast_cancer(as_frame=True).frame
print(realData)
model = GReaT(llm='distilgpt2', batch_size=16, epochs=1, save_steps=400000)
model.fit(realData)
synthetic_data = model.sample(n_samples=10)
synthetic_data.head()
unnir commented 1 year ago

Hi,

Thank you for reporting the issue, I confirm that somehow the new version of the huggingface package influenced the sampling.

I'm currently checking it.

unnir commented 1 year ago

Update:

For the Iris dataset, GReaT did great.

I just increased the epoch numbers to 10, also the batch_size can be increased.

model = GReaT(llm='distilgpt2', batch_size=64, epochs=10, save_steps=400000)

Please verify.

unnir commented 1 year ago

Okay, it has nothing to do with the new huggingface version, however, the accelare things is quite annoying.

To sample please increase max_length, as well as the number of epochs:

model = GReaT(llm='distilgpt2', batch_size=16, epochs=100, save_steps=400000)
model.fit(realData)
synthetic_data = model.sample(n_samples=10, max_length=2000)

Explanation: Given that the breast cancer dataset contains a substantial number of features, the textual representation of a sample is relatively lengthy. In this case, an LLM requires a larger number of tokens. Adjusting the max_length parameter can help.

Hope it works for you, please ping if you have any other issues.

ChristinaK97 commented 1 year ago

Thank you for your prompt response and all your help in resolving the issue! Your suggestions worked perfectly, and I was able to ran GReaT on both datasets by increasing the number of epochs and the max_length value for the breast cancer dataset. Keep up the great work!

unnir commented 1 year ago

Happy to read that!

If you have other issues or question feel free to open an issue :)