huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.24k stars 222 forks source link

Multiclass Training Error #531

Closed BIGdeadLock closed 5 months ago

BIGdeadLock commented 5 months ago

Model Loading:

setfit = SetFitModel.from_pretrained(
    MODEL_NAME, # e.g. "BAAI/bge-small-en-v1.5"
    multi_target_strategy="one-vs-rest",
    use_differentiable_head=False,
)

Dataset creation:

def create_setfit_dataset(documents):

    urls = [d.metadata['url'].split("/")[-1] for d in documents]
    urls_docs = {}

    X_train, X_test = [], []
    label_encoder = LabelEncoder()
    label_encoder.fit(urls)

    for d1 in tqdm(documents):
        p = d1.metadata['url'].split("/")[-1]
        label = label_encoder.transform([p])[0]
        urls_docs.setdefault(label, []).append(preprocess(d1))

    for label, docs in urls_docs.items():
        if len(docs) < 10:
            X_train += [{"text": d, "label": label} for d in docs]
            X_test += [{"text": docs[0], "label": label}]
        else:
            num_of_test = int(len(docs) * 0.01)
            X_train += [{"text": d, "label": label} for d in docs[-num_of_test:]]
            X_test += [{"text": d, "label": label} for d in docs[:-num_of_test]]

    return Dataset.from_pandas(pd.DataFrame(X_train), split="train"), Dataset.from_pandas(pd.DataFrame(X_train), split="test")

Training:

args = TrainingArguments(
    # Required parameter:
    output_dir=f"models/setfit-{MODEL_NAME}",
    # Optional training parameters:
    body_learning_rate=1.8859376752033417e-05,
    num_epochs=1,
    batch_size=8,
    warmup_proportion=0.1,
    sampling_strategy="oversampling",
    loss = CosineSimilarityLoss,
    # Optional tracking/debugging parameters:
    logging_strategy="steps",
    logging_steps=1000,
    evaluation_strategy="steps",
    logging_first_step=True,
    eval_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    run_name="finetune-setfit",
    load_best_model_at_end=True
)

    trainer = Trainer(
    model=setfit,
    train_dataset=train_dataset,
    eval_dataset=X_test,
    args=args,
    column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
    )
    trainer.train()

I try to train the model on a multiclass problem and keep getting the error:

TypeError: 'numpy.bool_' object is not iterable
File <command-119883175952931>, line 19
      7 train_dataset = sample_dataset(X_train, label_column="label", num_samples=10)
      9 trainer = Trainer(
     10 model=setfit,
     11 train_dataset=train_dataset,
   (...)
     17 column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
     18 )
---> 19 trainer.train()

<>

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/setfit/sampler.py:100, in ContrastiveDataset.generate_multilabel_pairs(self)
     98 def generate_multilabel_pairs(self) -> None:
     99     for (_text, _label), (text, label) in shuffle_combinations(self.sentence_labels):
--> 100         if any(np.logical_and(_label, label)):
    101             # logical_and checks if labels are both set for each class
    102             self.pos_pairs.append(InputExample(texts=[_text, text], label=1.0))
    103         else:

A training example:

{'text': <text>, 'label': 10}

BIGdeadLock commented 5 months ago

Never mind, the problem was not using onehot vector for the label.

ozefreitas commented 1 month ago

Hello, I just came across this exact problem. If I understood correctly, to fix the issue, you one-hot encoded the labels, converted to pandas and then to Dataset. and it just worked? In my case, when converting to one-hot, the sparsevector type is not recognized:

Problem: 6 nominal classes Steps:

  1. Convert to numerical with pypsark StringIndexer
  2. Convert to one-hot with pyspark OneHotEncoder
  3. Convert to pandas
  4. When converting to the Dataset type for SetFit, get a new error: ArrowInvalid: ('Could not convert SparseVector(6, {2: 1.0}) with type SparseVector: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column label_vector with type object')

Packages:

Both maintaining nominal or just convert to numerical triggers the error you mentioned of TypeError: 'numpy.bool_' object is not iterable. It works fine for binary text classification

Could you post the solution and a data sample of the working dataset?

BIGdeadLock commented 1 month ago

@ozefreitas

I did not used pyspark to one hot but did it with numpy. Seems like it's converting it to an sparse vector that may cause the problems. Try using something else like numpy for the encoding

ctandrewtran commented 1 month ago

train_dataset

Am confused on this- I thought we simply provide {'text': , 'label': 10}