brianmvk commented 1 year ago

Hi all, I'm trying to train SBERT to classify 2 sentences as being duplicates or not using set fit. How do I make it so that "column_mappings" exepts 2 sentences instead of one?

Below is the code I tried.

Create trainer

trainer = SetFitTrainer( model = model, #SBERT model train_dataset = train_dataset, eval_dataset = eval_dataset, loss_class = CosineSimilarityLoss, metric = "accuracy", batch_size = 32, #2X num_samples num_iterations = 60, num_epochs = 3, column_mapping ={"sentence1Title": "text1","sentence2Title": "text2", "duplicate": "label", "text": "text"} )

This is the error I get: ValueError: The column mapping expected the columns ['duplicate', 'sentence1Title', 'sentence2Title', 'text'] in the dataset, but the dataset had the columns ['Unnamed: 0', 'duplicate', 'sentence1Body', 'sentence1Title', 'sentence2Body', 'sentence2Title'].

thank you in advance!

kgourgou commented 1 year ago

Happy to learn something new here, but I think SetFit at this point doesn't support passing two sentences to the model (except as part of the same input string) as part of the training process.

Would a cross encoder be a better solution for this task?

tomaarsen commented 1 year ago

This is indeed currently not supported. This task has previously been called sentence pair classification, although it's usually used for Natural Language Inference. #91 is a related issue.

Tom Aarsen

Brianmvk11 commented 1 year ago

I eventually saw that this was not yet possible. Thank you!

huggingface / setfit

Train SBERT using 2 sentences as input for detecting if 2 sentences are duplicates on one another. Using setfit. #382

Create trainer