Closed splevine closed 9 months ago
@lewtun Is there anybody that could help with batch inference?
@tomaarsen Hi Tom, any work/thinking on batch inference?
Hello @splevine,
My apologies for missing this issue earlier. My understanding is that SetFit already implements batch inference. You can provide model.predict
with a list of strings, and they're turned into embeddings and then classified in one call each:
https://github.com/huggingface/setfit/blob/d9aff37af91f200ed640839d1aebe4c9a96e9563/src/setfit/modeling.py#L428-L436
e.g.
>>> model.predict(["This sentence sucks", "This one is awesome", "This one is great!"])
tensor([0, 1, 1], dtype=torch.int32)
Although perhaps we are misunderstanding each other.
Thanks for following up! I was wondering if there were any optimizations for larger samples say 100K or 1m+ messages. Also some sort of id2label property could also be beneficial
It definitely could be. I think one of the reasons that it doesn't exist at the moment is that with a Logistic Regression head (i.e. the default, not using use_differentiable_head=True
), you can use string labels.
As for the optimizations, I don't know of any good tips in that regard. I don't tend to work with that much data.
batch_size
to model.predict
(which is passed down to SentenceTransformers.encode). Turns out, it makes a huge difference:labels
, e.g. SetFitModel.from_pretrained("...", labels=["negative", "positive"])
. These will be used in model.predict
.labels
data will be stored in the model repo & used when loaded. So, you only need to specify the labels during training once, and then it'll always be used when you load the model again.Stay tuned, expect the update this week.
Closed via #439
Has there been any work done on using SetFit to make predictions on large datasets in batch/bulk. Any recommendations on how to run SetFit classifier on say 1m documents?
I'm currently doing it inside dataframe
I have this round about way of getting
id2label
.Any help would be appreciated.
Thanks!