kathrinse / be_great

A novel approach for synthesizing tabular data using pretrained large language models
MIT License
254 stars 43 forks source link

An error has occurred: Breaking the generation loop! #38

Closed FayzElRazaz closed 9 months ago

FayzElRazaz commented 10 months ago

Hi guys,

After training a model to create synthetic data, I get the following message :

An error has occurred: Breaking the generation loop! To address this issue, consider fine-tuning the GReaT model for an longer period. This can be achieved by increasing the number of epochs. Alternatively, you might consider increasing the max_length parameter within the sample function. For example: model.sample(n_samples=10, max_length=2000) If the problem persists despite these adjustments, feel free to raise an issue on our GitHub page at: https://github.com/kathrinse/be_great/issues

I increased the max_length parameter and the number of epochs, but still get the issue.

Here is my code :

from be_great import GReaT from sklearn.model_selection import train_test_split X_train,X_test = train_test_split(df,train_size = 0.01) model = GReaT(llm='distilgpt2', batch_size=8, epochs=4) model.fit(X_train)

I take a part of my initial dataset to reduce the time of learning (initial dataset got 800K rows).

Does anyone already had this issue and found how to solve it ?

Thanks

unnir commented 10 months ago

Would it be possible to provide us with your data or a sample? How many columns (features) are in your dataset?

Regarding the hyperparameters, how long have you trained the model, what is the number of epochs?

unnir commented 10 months ago

please try again I've updated the code, it might fix your issue.

iamamiramine commented 1 month ago

I have the same issue. I am using the "Health" dataset. I've attached a screenshot of the first few rows of the dataset. image

I trained the model on 1000 epochs, and I trained the model on larger datasets for less epochs and none has given me this error. What do you think I should do to fix the issue?

unnir commented 1 month ago

@iamamiramine thank you for attaching the screenshot. If I understood you correctly, you have no issues with other datasets but this one, right?

The problem could be the row length of your dataset it is too long. Do you have enough GPU memory (VRAM)? Another solution could be if you simplify your dataset, like remove columns with the same value or something similar