huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.19k stars 219 forks source link

[question]: creating a custom dataset class like `sst` to fit into `setfit`, throws `Cannot index by location index with a non-integer key` #289

Closed maifeeulasad closed 1 year ago

maifeeulasad commented 1 year ago

I'm trying to experiment with PyTorch some model; the dataset they were using for the experiment is sst

But I'm also learning PyTorch, so I thought it would be better to play with Dataset class and create my own dataset.

So this was my approach:

class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe
        self.column_names = ['text','label']

    def __getitem__(self, index):
        print('index: ',index)
        row = self.dataframe.iloc[index].to_numpy()
        features = row[1:]
        label = row[0]
        return features, label

    def __len__(self):
        return len(self.dataframe)

df = pd.DataFrame(np.array([ 
    ["hello", 0] ,
    ["sex", 1] ,
    ["beshi kore sex", 1],]),
  columns=['text','label'])

dataset = CustomDataset(dataframe=df)

Instead of creating sub-categories like validation/test/train, I'm just trying to create one custom Dataset class at first.

And it keeps giving me Cannot index by location index with a non-integer key During conceptual development, I tried this: df.iloc[0].to_numpy(), and it works absolutely fine. But it's sending index: text for some reason. I even tried putting an 'id' column.

But I'm sure that there must be some other way to achieve this. How can I resolve this issue? As my code worked fine for sst, as this not working any longer. I'm pretty sure, this is not one to one mapping.

Complete code:

#!pip install sentence_transformers -q
#!pip install setfit -q

from sentence_transformers.losses import CosineSimilarityLoss
from torch.utils.data import Dataset
import pandas as pd
import numpy as np
from setfit import SetFitModel, SetFitTrainer, sample_dataset

class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe
        self.column_names = ['id','text','label']

    def __getitem__(self, index):
        print('index: ',index)
        row = self.dataframe.iloc[index].to_numpy()
        features = row[1:]
        label = row[0]
        return features, label

    def __len__(self):
        return len(self.dataframe)

df = pd.DataFrame(np.array([ [1,"hello", 0] ,
[2,"sex", 1] ,
[3,"beshi kore sex", 1],]),columns=['id','text','label'])
# df.head()

dataset = CustomDataset(dataframe=df)

# Load a dataset from the Hugging Face Hub
# dataset = load_dataset("sst2") # HERE, previously I was simply using sst/sst2

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = dataset
eval_dataset = dataset

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=1, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
)

# Train and evaluate
trainer.train()
tomaarsen commented 1 year ago

Hello!

SetFit actually uses datasets Dataset instances, rather than the purely torch Dataset. There is some documentation on the huggingface datasets here, but there does not seem to be a very convenient way to convert a torch Dataset to a huggingface datasets Dataset. (Although it does seem possible in a hacky way).

Hope that helps.

maifeeulasad commented 1 year ago

@tomaarsen Thanks for this. I'm closing it now. But if I have any questions, I will send them your way.

Thanks a lot 🍸