huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.14k stars 217 forks source link

Batch Inference using SetFit classifier #244

Closed splevine closed 9 months ago

splevine commented 1 year ago

Has there been any work done on using SetFit to make predictions on large datasets in batch/bulk. Any recommendations on how to run SetFit classifier on say 1m documents?

I'm currently doing it inside dataframe

df['probs'] = list(model.predict_proba(df['text']))
df['predicted_id'] = df['probs'].apply(np.argmax)
df['prediction'] = df['predicted_id'].map(id2label)
df['max_prob'] = df['probs'].apply(max)

I have this round about way of getting id2label.

Any help would be appreciated.

Thanks!

splevine commented 1 year ago

@lewtun Is there anybody that could help with batch inference?

splevine commented 1 year ago

@tomaarsen Hi Tom, any work/thinking on batch inference?

tomaarsen commented 1 year ago

Hello @splevine,

My apologies for missing this issue earlier. My understanding is that SetFit already implements batch inference. You can provide model.predict with a list of strings, and they're turned into embeddings and then classified in one call each: https://github.com/huggingface/setfit/blob/d9aff37af91f200ed640839d1aebe4c9a96e9563/src/setfit/modeling.py#L428-L436

e.g.

>>> model.predict(["This sentence sucks", "This one is awesome", "This one is great!"])
tensor([0, 1, 1], dtype=torch.int32)

Although perhaps we are misunderstanding each other.

splevine commented 1 year ago

Thanks for following up! I was wondering if there were any optimizations for larger samples say 100K or 1m+ messages. Also some sort of id2label property could also be beneficial

tomaarsen commented 1 year ago

It definitely could be. I think one of the reasons that it doesn't exist at the moment is that with a Logistic Regression head (i.e. the default, not using use_differentiable_head=True), you can use string labels.

See an example ```python from datasets import load_dataset from sentence_transformers.losses import CosineSimilarityLoss from setfit import SetFitModel, SetFitTrainer, sample_dataset # Load a dataset from the Hugging Face Hub dataset = load_dataset("sst2") # Simulate the few-shot regime by sampling 8 examples per class train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8) eval_dataset = dataset["validation"] def stringize_labels(sample): sample["label"] = ["negative", "positive"][sample["label"]] return sample train_dataset = train_dataset.map(stringize_labels) eval_dataset = eval_dataset.map(stringize_labels) # Load a SetFit model from Hub model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2") # Create trainer trainer = SetFitTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, loss_class=CosineSimilarityLoss, metric="accuracy", batch_size=16, num_iterations=20, # The number of text pairs to generate for contrastive learning num_epochs=1, # The number of epochs to use for contrastive learning column_mapping={"sentence": "text", "label": "label"} ) # Train and evaluate trainer.train() print(model.predict(["Wow, that was awful.", "I loved it."])) ``` ``` Applying column mapping to training dataset Generating Training Pairs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 4000.86it/s] ***** Running training ***** Num examples = 640 Num epochs = 1 Total optimization steps = 40 Total train batch size = 16 Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:04<00:00, 8.63it/s] Epoch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.64s/it] ['negative' 'positive'] ``` However, it's a bit flimsy in other ways... Like the `evaluate` accuracy doesn't work with it.

As for the optimizations, I don't know of any good tips in that regard. I don't tend to work with that much data.

tomaarsen commented 9 months ago

439 will introduce batch_size to model.predict (which is passed down to SentenceTransformers.encode). Turns out, it makes a huge difference:

setfit_speed_per_batch_size

  1. At v1.0.0, you can load a SetFit model with labels, e.g. SetFitModel.from_pretrained("...", labels=["negative", "positive"]). These will be used in model.predict.
  2. At v1.0.0, this labels data will be stored in the model repo & used when loaded. So, you only need to specify the labels during training once, and then it'll always be used when you load the model again.

Stay tuned, expect the update this week.

tomaarsen commented 9 months ago

Closed via #439