huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.24k stars 223 forks source link

ValueError: Multioutput target data is not supported with label binarization #499

Open isaldiviagonzatti opened 8 months ago

isaldiviagonzatti commented 8 months ago

After running the trainer for >5 hours, I get ValueError: Multioutput target data is not supported with label binarization

My train_dataset and eval_dataset have one text column, one labels column (binary) and one column for each label. So it's the same as in the text-classification_multilabel.ipynb example.

Any idea what could be going on? Thanks

amina8annane commented 3 months ago

Hello, I have the same problem, have you find a solution ? Thanks

isaldiviagonzatti commented 3 months ago

@amina8annane Sorry, honestly I don't remember if or how I solved it. I know I did get results with setfit but they were quite poor for my use case, so I didn't pursue it further. See if either of these resources help: https://www.reddit.com/r/learnmachinelearning/comments/r7ki6k/how_to_fix_multioutput_target_data_is_not/ AND https://stackoverflow.com/questions/58171410/multioutput-target-data-is-not-supported-with-label-binarization

In case you wanna compare to my code:

I checked and have the following:

def encode_labels(record):
    return {"labels": [record[feature] for feature in features]}

dataset = ds['train'].map(encode_labels)

train_dataset = dataset.select(samples)
eval_dataset = dataset.select(
    np.setdiff1d(np.arange(len(dataset)), samples)
)

from setfit import SetFitModel
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id, multi_target_strategy="multi-output") #  multi-output
model.model_head

from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset

args = TrainingArguments(
    head_learning_rate= 0.0006155918397454662,
    batch_size=1, # 1
    num_epochs=1,
    # max_steps= 2350, # overrides num_epochs
    # eval_max_steps=10,
    # num_iterations=20,
    max_length=1000
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    column_mapping={"abstract": "text", "labels": "label"},
)
trainer.train()