Closed ChristinaK97 closed 1 year ago
Hi,
Thank you for reporting the issue, I confirm that somehow the new version of the huggingface package influenced the sampling.
I'm currently checking it.
Update:
For the Iris dataset, GReaT did great.
I just increased the epoch numbers to 10, also the batch_size can be increased.
model = GReaT(llm='distilgpt2', batch_size=64, epochs=10, save_steps=400000)
Please verify.
Okay, it has nothing to do with the new huggingface version, however, the accelare things is quite annoying.
To sample please increase max_length, as well as the number of epochs:
model = GReaT(llm='distilgpt2', batch_size=16, epochs=100, save_steps=400000)
model.fit(realData)
synthetic_data = model.sample(n_samples=10, max_length=2000)
Explanation: Given that the breast cancer dataset contains a substantial number of features, the textual representation of a sample is relatively lengthy. In this case, an LLM requires a larger number of tokens. Adjusting the max_length parameter can help.
Hope it works for you, please ping if you have any other issues.
Thank you for your prompt response and all your help in resolving the issue! Your suggestions worked perfectly, and I was able to ran GReaT on both datasets by increasing the number of epochs and the max_length value for the breast cancer dataset. Keep up the great work!
Happy to read that!
If you have other issues or question feel free to open an issue :)
Hello, I've encountered an issue while running Great on the breast cancer dataset from sklearn. While the training process proceeds smoothly without any issues, I have noticed that when attempting to generate samples using the trained model, the script has been running for more than 15 minutes without returning the samples, even for very small n_samples values (the progress bar also appears to be frozen). Same with the Iris dataset. I find this behavior peculiar since Great ran as expected with the California housing dataset. I wanted to ask if you have come across this particular issue before or if you have any suggestions on how to handle it.
Here is my script: