huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.99k stars 26.29k forks source link

How to use distill-BERT with different datasets? #22817

Closed sauravtii closed 1 year ago

sauravtii commented 1 year ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

I recently read this and was wondering how to use distill-BERT (which is pre-trained with imdb dataset) with a different dataset (for eg. this dataset)?

Expected behavior

Distill-BERT should work with different datasets.

amyeroberts commented 1 year ago

Hi, @sauravtii. Thanks for raising an issue!

In general, this is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

I recommend looking at the NLP course which will take you through using and training tokenizers, datasets, and models.

sauravtii commented 1 year ago

@amyeroberts Thanks for your response. I was able to use Distil-BERT with different datasets.

Now, I am trying out this tutorial which basically trains distil-BERT with IMDB dataset (very similar to this tutorial). But I don't know why my accuracy isn't increasing even after training for a significant amount of time and also by using the entire dataset. Below I have attached client.py file:

client.py:

from collections import OrderedDict
import warnings

import flwr as fl
import torch
import numpy as np

import random
from torch.utils.data import DataLoader

from datasets import load_dataset, load_metric

from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import AdamW

warnings.filterwarnings("ignore", category=UserWarning)

DEVICE = "cuda:1"

CHECKPOINT = "distilbert-base-uncased"  # transformer model checkpoint

def load_data():
    """Load IMDB data (training and eval)"""
    raw_datasets = load_dataset("imdb")
    raw_datasets = raw_datasets.shuffle(seed=42)

    # remove unnecessary data split
    del raw_datasets["unsupervised"]

    tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True)

    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

    tokenized_datasets = tokenized_datasets.remove_columns("text")
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    trainloader = DataLoader(
        tokenized_datasets["train"],
        shuffle=True,
        batch_size=32,
        collate_fn=data_collator,
    )

    testloader = DataLoader(
        tokenized_datasets["test"], batch_size=32, collate_fn=data_collator
    )

    return trainloader, testloader

def train(net, trainloader, epochs):
    optimizer = AdamW(net.parameters(), lr=5e-5)
    net.train()
    for i in range(epochs):
        print("Epoch: ", i+1)
        j = 1
        print("####################### The length of the trainloader is: ", len(trainloader))        
        for batch in trainloader:
            print("####################### The batch number is: ", j)
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            outputs = net(**batch)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            j += 1

def test(net, testloader):
    metric = load_metric("accuracy")
    loss = 0
    net.eval()
    for batch in testloader:
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        with torch.no_grad():
            outputs = net(**batch)
        logits = outputs.logits
        loss += outputs.loss.item()
        predictions = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])
    loss /= len(testloader.dataset)
    accuracy = metric.compute()["accuracy"]
    return loss, accuracy

def main():
    net = AutoModelForSequenceClassification.from_pretrained(
        CHECKPOINT, num_labels=2
    ).to(DEVICE)

    trainloader, testloader = load_data()

    # Flower client
    class IMDBClient(fl.client.NumPyClient):
        def get_parameters(self, config):
            return [val.cpu().numpy() for _, val in net.state_dict().items()]

        def set_parameters(self, parameters):
            params_dict = zip(net.state_dict().keys(), parameters)
            state_dict = OrderedDict({k: torch.Tensor(v) for k, v in params_dict})
            net.load_state_dict(state_dict, strict=True)

        def fit(self, parameters, config):
            self.set_parameters(parameters)
            print("Training Started...")
            train(net, trainloader, epochs=1)
            print("Training Finished.")
            return self.get_parameters(config={}), len(trainloader), {}

        def evaluate(self, parameters, config):
            self.set_parameters(parameters)
            loss, accuracy = test(net, testloader)
            print({"loss": float(loss), "accuracy": float(accuracy)})
            return float(loss), len(testloader), {"loss": float(loss), "accuracy": float(accuracy)}

    # Start client
    fl.client.start_numpy_client(server_address="localhost:5040", client=IMDBClient())

if __name__ == "__main__":
    main()

Can I get any help, please?

amyeroberts commented 1 year ago

Hi @sauravtii, glad to hear you were able to use a different dataset :)

As mentioned above, this is really a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

As a side note, training time and performance is all relative. To help people help you in the forum, it's best to give as much information as possible e.g. how long the model was training for, logs of the accuracy observed and the behaviour you expect. In the shared script, it looks like the model is only training for a single epoch - I would start with increasing this first.

sauravtii commented 1 year ago

@amyeroberts Thanks for your reponse. I tried searching for the answer to my question in the forums but wasn't able to, therefore I would really appreciate if you can provide me the link to the answer (if you find one in the forums).

Also, I have trained the model for a large number of epochs (ranging from 500-1000), and the one mentioned in the script is just for the sake of an example :)

amyeroberts commented 1 year ago

@sauravtii I don't know if there's an answer in the forums. What I'm suggesting is you post in the forums with your question and people in the community will be able to discuss with you there.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.