huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.23k stars 220 forks source link

ValueError "invalid literal for int() with base 10" in trainer.evaluate (dataset created from pandas) #228

Closed fpservant closed 11 months ago

fpservant commented 1 year ago

ValueError in trainer.evaluate in following conditions : Dataset created from a pandas DataFrame, label column containing strings.

Here is what I do:

# pip install setfit --upgrade

from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset 
import pandas as pd

df = pd.DataFrame([
    ['text 1','LABEL1'], ['text 10','LABEL1'], ['text 11','LABEL1'], ['text 12','LABEL1'],
    ['foo 2','LABEL2'], ['foo 20','LABEL2'], ['foo 21','LABEL2'], ['foo 22','LABEL2'],
    ['bar 3','LABEL3'], ['bar 30','LABEL3'], ['bar 31','LABEL3'], ['bar 32','LABEL3'],
       ])
df.columns = ['text','label']

train_ds = Dataset.from_pandas(df)
eval_ds = train_ds # yes, OK, normally not what we would do

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=4,
    num_iterations=3, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for constrastive learning
)

trainer.train()
trainer.evaluate()

Here's the output of last line:

***** Running evaluation *****
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/vl/96b85fpj5_s8ylvy5ncrfrnr0000gn/T/ipykernel_23249/2732109216.py in <module>
----> 1 trainer.evaluate()

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/setfit/trainer.py in evaluate(self)
    409             metric_fn = evaluate.load(self.metric, config_name=metric_config)
    410 
--> 411             return metric_fn.compute(predictions=y_pred, references=y_test)
    412 
    413         elif callable(self.metric):

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/evaluate/module.py in compute(self, predictions, references, **kwargs)
    430 
    431         if any(v is not None for v in inputs.values()):
--> 432             self.add_batch(**inputs)
    433         self._finalize()
    434 

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/evaluate/module.py in add_batch(self, predictions, references, **kwargs)
    484                 if len(column) > 0:
    485                     self._enforce_nested_string_type(self.current_features[key], column[0])
--> 486             batch = self.current_features.encode_batch(batch)
    487             self.writer.write_batch(batch)
    488         except (pa.ArrowInvalid, TypeError):

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/datasets/features/features.py in encode_batch(self, batch)
   1594         for key, column in batch.items():
   1595             column = cast_to_python_objects(column)
-> 1596             encoded_batch[key] = [encode_nested_example(self[key], obj) for obj in column]
   1597         return encoded_batch
   1598 

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/datasets/features/features.py in <listcomp>(.0)
   1594         for key, column in batch.items():
   1595             column = cast_to_python_objects(column)
-> 1596             encoded_batch[key] = [encode_nested_example(self[key], obj) for obj in column]
   1597         return encoded_batch
   1598 

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/datasets/features/features.py in encode_nested_example(schema, obj, level)
   1201     # ClassLabel will convert from string to int, TranslationVariableLanguages does some checks
   1202     elif isinstance(schema, (Audio, Image, ClassLabel, TranslationVariableLanguages, Value, _ArrayXD)):
-> 1203         return schema.encode_example(obj) if obj is not None else None
   1204     # Other object should be directly convertible to a native Arrow type (like Translation and Translation)
   1205     return obj

~/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/datasets/features/features.py in encode_example(self, value)
    463             return bool(value)
    464         elif pa.types.is_integer(self.pa_type):
--> 465             return int(value)
    466         elif pa.types.is_floating(self.pa_type):
    467             return float(value)

ValueError: invalid literal for int() with base 10: 'LABEL1'

Best Regards, fps

tomaarsen commented 1 year ago

Hello @fpservant,

I believe this is because accuracy from the Hugging Face's evaluate package expects the ground truth to be integer labels, rather than strings:

>>> import evaluate
>>> accuracy_metric = evaluate.load("accuracy")
>>> accuracy_metric.compute(references=[0, 1], predictions=[1, 1])
{'accuracy': 0.5}
>>> accuracy_metric.compute(references=["a", "b"], predictions=["b", "b"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[sic]\evaluate\module.py", line 432, in compute
    self.add_batch(**inputs)
  File "[sic]\evaluate\module.py", line 486, in add_batch
    batch = self.selected_feature_format.encode_batch(batch)
  File "[sic]\datasets\features\features.py", line 1596, in encode_batch
    encoded_batch[key] = [encode_nested_example(self[key], obj) for obj in column]
  File "[sic]\datasets\features\features.py", line 1596, in <listcomp>
    encoded_batch[key] = [encode_nested_example(self[key], obj) for obj in column]
  File "[sic]\datasets\features\features.py", line 1203, in encode_nested_example
    return schema.encode_example(obj) if obj is not None else None
  File "[sic]\datasets\features\features.py", line 465, in encode_example
    return int(value)
ValueError: invalid literal for int() with base 10: 'b'

This can be resolved by encoding your labels, e.g.:

# Apply encoding
label_to_int = {label: int for int, label in enumerate(df["label"].unique())}
# {'LABEL1': 0, 'LABEL2': 1, 'LABEL3': 2}
df["label"] = df["label"].map(label_to_int)
(See full working script ready to copy-paste) ```python from sentence_transformers.losses import CosineSimilarityLoss from setfit import SetFitModel, SetFitTrainer from datasets import Dataset import pandas as pd df = pd.DataFrame([ ['text 1','LABEL1'], ['text 10','LABEL1'], ['text 11','LABEL1'], ['text 12','LABEL1'], ['foo 2','LABEL2'], ['foo 20','LABEL2'], ['foo 21','LABEL2'], ['foo 22','LABEL2'], ['bar 3','LABEL3'], ['bar 30','LABEL3'], ['bar 31','LABEL3'], ['bar 32','LABEL3'], ]) df.columns = ['text','label'] # Apply encoding label_to_int = {label: int for int, label in enumerate(df["label"].unique())} # {'LABEL1': 0, 'LABEL2': 1, 'LABEL3': 2} df["label"] = df["label"].map(label_to_int) train_ds = Dataset.from_pandas(df) eval_ds = train_ds # yes, OK, normally not what we would do model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2") trainer = SetFitTrainer( model=model, train_dataset=train_ds, eval_dataset=eval_ds, loss_class=CosineSimilarityLoss, metric="accuracy", batch_size=4, num_iterations=3, # The number of text pairs to generate for contrastive learning num_epochs=1, # The number of epochs to use for constrastive learning ) trainer.train() print(trainer.evaluate()) ```

This now outputs:

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
***** Running training *****
  Num examples = 72
  Num epochs = 1
  Total optimization steps = 18
  Total train batch size = 4
Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:06<00:00,  2.99it/s]
Epoch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.02s/it]
***** Running evaluation *****
{'accuracy': 1.0}

I don't believe that this is mentioned in the README, and we don't currently have any docs set up. Apologies for this. When docs are being set up, this certainly should be included. Alternatively, we could implement an encoding behind the scenes as a pre-processing step before training when we encounter string labels.

Hope this helps somewhat.

fpservant commented 1 year ago

Hi @tomaarsen, thank you for your answer. The behavior is a bit surprising, as trainer.train seems to work perfectly with the string labels (and the SetFitTrainer knows how to convert from label as a string to an indice, as shown by the results of trainer.model.predict and trainer.model.predict_proba) - So it seems to me that there is some form on an "encoding behind the scene" at the trainer level. It would therefore be more user friendly if it were working also for the evaluate. But anyway, thank you very much, your answer was fast, informative and helpful. Best Regards, fps

tomaarsen commented 1 year ago

I agree, it is odd: (I believe) only the evaluation breaks, while everything else works correctly. It should be possible to counteract this by encoding labels only if we use a Hugging Face evaluate metric, and only in the trainer.evaluate call. However, there is also something to be said that this should simply be a feature request for Hugging Face's evaluate, instead.

In case you're looking for them, another alternative that I just thought of is that you can also provide a callable as the metric: https://github.com/huggingface/setfit/blob/eee595ede962b9a9bbe62d4d919f5629d2fc2868/src/setfit/trainer.py#L427-L428

That way, you can provide your own functions for accuracy or other metrics which do allow string labels.

AppleMax1992 commented 1 year ago

I meet the same problem today. And I find that instead of using load_metrcis from datasets package, you can just use accuracy_score from sklearn to build the compute_metrics as follow: from sklearn.metrics import accuracy_score def compute_metrics(y_pred, y_test): accuracy = accuracy_score(y_test,y_pred) return {"accuracy": accuracy} this would just solve the problem.

anth0nyhak1m commented 1 year ago

This is actually a really awful inconvenience that exists til today. Please fix.

tomaarsen commented 11 months ago

439 should resolve this problem, evaluating with string labels should be supported when that gets merged and released this week.

tomaarsen commented 11 months ago

Closed via #439

vbabashov commented 10 months ago

I still get that error with Seq2Seq trainer and I am using the latest version of the transformers.

tomaarsen commented 10 months ago

SetFit doesn't offer a Seq2Seq trainer, that is specific to transformers: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainer

vbabashov commented 9 months ago

I meant to refer to the functionality of the accuracy metric score (i.e., working with integer labels) from Hugging Face