patil-suraj / exploring-T5

A repo to explore different NLP tasks which can be solved using T5
168 stars 46 forks source link

Adding ByT5 notebook #13

Open mapmeld opened 3 years ago

mapmeld commented 3 years ago

Hi ! I used your notebook as a starting point for fine-tuning a T5-based model (ByT5) with the latest versions of PyTorch Lightning, Transformers, etc. I also use the Datasets library instead of downloading from Stanford, so it's a little more adaptable. Feel free to update or let me know if this can be added as a new example notebook.

https://colab.research.google.com/drive/1syXmhEQ5s7C59zU8RtHVru0wAvMXTSQ8

janyfe commented 3 years ago

Hi @mapmeld. I've run your notebook. Finetuned byt5-small model always generates 'negative' target. It leads to 0.5 test accuracy. When I switch to t5-base, finetuned model's behaviour and metrics became reasonable (test accuracy is something around 0.8). Do you have any ideas what is wrong with byt5 finetuning?

By the way, I have one suggestion. Instead of slicing decoded outputs, you can use tokenizer.decode(ids, skip_special_tokens=True)

jijo7 commented 1 year ago

Hi @janyfe I would appreciate if you could let me know how I can use this code for my IMBD dataset, which is of the following format:

# train data
f = open("train.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], line[1], ",".join(line[2:])] for line in lines] 
train = pd.DataFrame(lines[1:])
train = train.drop(train.columns[0], axis=1) # drop first column
print("\ntrain set size:", train.shape)
print("\nNumber of positives: ", train[1].astype(int).sum())
train = train.rename(columns={1: 'sentiment', 2: 'review'})
imdb_reviews = train["review"]
sentiments = train["sentiment"]
sentiments = [int(v) for v in sentiments]
sentiments=pd.DataFrame(sentiments)
sentiments=sentiments.rename(columns={0:'sentiment'})
sentiments = sentiments["sentiment"].tolist()

# test data
f = open("test.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], ",".join(line[1:])] for line in lines] 
test = pd.DataFrame(lines[1:])
id_test = test[0]
print("\ntest set:", test.shape)
test = pd.DataFrame(test[1])
print("Number of test sentences: {:,}\n".format(test.shape[0]))
test = test.rename(columns={1:'review'}) 

The sentiments are 0 or 1. Also, my test set does not include the associated sentiment i.e., it does not include labels.

Best,