kathrinse / be_great

A novel approach for synthesizing tabular data using pretrained large language models
MIT License
276 stars 46 forks source link

Unable to sample #36

Closed vinay-k12 closed 12 months ago

vinay-k12 commented 1 year ago

Hi, I'm trained the model on tabular data with categorical columns of high cardinality and text.

The system is failing when creating synthetic data.

image

Even after increasing the 'max_length to 10K, the same error persists. Any ideas on how to resolve this?

Madnex commented 1 year ago

How many epochs did you train? This error suggests that it is not possible to transform the text generated by the model back to the columnar format. Meaning the model is not able to generate the proper text format but instead gives out some malformed text. I had the same for some more complicated data set and further fine tuning helped to fix this.

vinay-k12 commented 1 year ago

Oh. I thought it was max_length error as GPT2 has limitation on token length of 1024 (I think). I read somewhere that this is the limitation of training TableGPT.

Was it large number of columns in your case as well? I have close to 120+ with text, high cardinal categorical features and numerical features.

Madnex commented 1 year ago

Well, it could also be related to the token limit of GPT2 but in my experience with this library this would already fail during training. If you are able to train the model you should also be able to sample from it.

In my setup I have only around 20 columns but one column that is containing actually a lot of text which also blows up the token count a lot.

I also tried once with 100+ columns and it failed during training due to the token limit length. For debugging purposes it is also helpful to use CPU instead of CUDA because you might get more helpful error messages.

unnir commented 1 year ago

Could you please provide more details on your hyperparameters choice? Usually training the model longer helps, e.g., more epochs.