Closed sauravtii closed 1 year ago
Hi, @sauravtii. Thanks for raising an issue!
In general, this is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.
I recommend looking at the NLP course which will take you through using and training tokenizers, datasets, and models.
@amyeroberts Thanks for your response. I was able to use Distil-BERT with different datasets.
Now, I am trying out this tutorial which basically trains distil-BERT with IMDB dataset (very similar to this tutorial). But I don't know why my accuracy isn't increasing even after training for a significant amount of time and also by using the entire dataset. Below I have attached client.py
file:
client.py
:
from collections import OrderedDict
import warnings
import flwr as fl
import torch
import numpy as np
import random
from torch.utils.data import DataLoader
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import AdamW
warnings.filterwarnings("ignore", category=UserWarning)
DEVICE = "cuda:1"
CHECKPOINT = "distilbert-base-uncased" # transformer model checkpoint
def load_data():
"""Load IMDB data (training and eval)"""
raw_datasets = load_dataset("imdb")
raw_datasets = raw_datasets.shuffle(seed=42)
# remove unnecessary data split
del raw_datasets["unsupervised"]
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns("text")
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainloader = DataLoader(
tokenized_datasets["train"],
shuffle=True,
batch_size=32,
collate_fn=data_collator,
)
testloader = DataLoader(
tokenized_datasets["test"], batch_size=32, collate_fn=data_collator
)
return trainloader, testloader
def train(net, trainloader, epochs):
optimizer = AdamW(net.parameters(), lr=5e-5)
net.train()
for i in range(epochs):
print("Epoch: ", i+1)
j = 1
print("####################### The length of the trainloader is: ", len(trainloader))
for batch in trainloader:
print("####################### The batch number is: ", j)
batch = {k: v.to(DEVICE) for k, v in batch.items()}
outputs = net(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
j += 1
def test(net, testloader):
metric = load_metric("accuracy")
loss = 0
net.eval()
for batch in testloader:
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.no_grad():
outputs = net(**batch)
logits = outputs.logits
loss += outputs.loss.item()
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
loss /= len(testloader.dataset)
accuracy = metric.compute()["accuracy"]
return loss, accuracy
def main():
net = AutoModelForSequenceClassification.from_pretrained(
CHECKPOINT, num_labels=2
).to(DEVICE)
trainloader, testloader = load_data()
# Flower client
class IMDBClient(fl.client.NumPyClient):
def get_parameters(self, config):
return [val.cpu().numpy() for _, val in net.state_dict().items()]
def set_parameters(self, parameters):
params_dict = zip(net.state_dict().keys(), parameters)
state_dict = OrderedDict({k: torch.Tensor(v) for k, v in params_dict})
net.load_state_dict(state_dict, strict=True)
def fit(self, parameters, config):
self.set_parameters(parameters)
print("Training Started...")
train(net, trainloader, epochs=1)
print("Training Finished.")
return self.get_parameters(config={}), len(trainloader), {}
def evaluate(self, parameters, config):
self.set_parameters(parameters)
loss, accuracy = test(net, testloader)
print({"loss": float(loss), "accuracy": float(accuracy)})
return float(loss), len(testloader), {"loss": float(loss), "accuracy": float(accuracy)}
# Start client
fl.client.start_numpy_client(server_address="localhost:5040", client=IMDBClient())
if __name__ == "__main__":
main()
Can I get any help, please?
Hi @sauravtii, glad to hear you were able to use a different dataset :)
As mentioned above, this is really a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.
As a side note, training time and performance is all relative. To help people help you in the forum, it's best to give as much information as possible e.g. how long the model was training for, logs of the accuracy observed and the behaviour you expect. In the shared script, it looks like the model is only training for a single epoch - I would start with increasing this first.
@amyeroberts Thanks for your reponse. I tried searching for the answer to my question in the forums but wasn't able to, therefore I would really appreciate if you can provide me the link to the answer (if you find one in the forums).
Also, I have trained the model for a large number of epochs (ranging from 500-1000), and the one mentioned in the script is just for the sake of an example :)
@sauravtii I don't know if there's an answer in the forums. What I'm suggesting is you post in the forums with your question and people in the community will be able to discuss with you there.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.11.3Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I recently read this and was wondering how to use distill-BERT (which is pre-trained with imdb dataset) with a different dataset (for eg. this dataset)?
Expected behavior
Distill-BERT should work with different datasets.