Closed BenGraWarBuf closed 9 months ago
That's a pretty large number of examples! Can you take a stratified sample of your dataset, say, k=30 examples per class, then try to finetune that one and see how it does? Then you can increase $k$ as needed. If you have a lot of classes, you may want to start from $k=5$ to keep things manageable.
The contrastive approach takes a dataset of size $n$ and creates a new dataset of size $O(n(n-1)/2)$, so if $n$ is large, the contrastive dataset will be huge.
Hello!
@kgourgou is exactly right, and I second his recommendation. You can use this:
from setfit import sample_dataset
# Load your dataset
dataset = load_dataset(...)
# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
This also automatically gives you an even distribution of classes, should you be interested in that.
With sufficient data, finetuning a model using 🤗 Transformers tends to outperform SetFit. See for example this image:
The finetuned model on the full dataset outperformed SetFit here. With other words, you may want to consider that option.
yep you are both absolutely right! thank you for the suggestions
Good luck!
Does anyone have a rule of thumb for what size local GPU can be used to fine-tune with proprietary data? Or, is there another way to speed up the training that I'm not aware of? I'm fine-tuning a multiclass text classifier on NVIDIA GeForce GTX 1660 SUPER 6GB. obviously not a powerful GPU but I'm not against upgrading to a more powerful unit. It just takes a while to fine-tune.
my training params are as follows: Num examples = 15365720 Num epochs = 1 Total optimization steps = 320120 Total train batch size = 48
code: from setfit import SetFitModel, SetFitTrainer from sentence_transformers.losses import CosineSimilarityLoss
Load a SetFit model from Hub
model_id = "sentence-transformers/paraphrase-mpnet-base-v2" model = SetFitModel.from_pretrained(model_id)
Create trainer
trainer = SetFitTrainer( model=model, train_dataset=train_ds, eval_dataset=eval_ds, loss_class=CosineSimilarityLoss, metric="accuracy", batch_size=48, num_iterations=20, # The number of text pairs to generate for contrastive learning num_epochs=1, # The number of epochs to use for constrastive learning column_mapping={"line_text": "text", "label": "label"} )
Train
trainer.train()