google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.17k stars 756 forks source link

Fill-in-the-Blank Text Generation #133

Closed nkrnrnk closed 4 years ago

nkrnrnk commented 4 years ago

Hi,

Would it be possible to share the pre-trained models and/or scripts for "Fill-in-the-Blank Text Generation" mentioned in this blog post? https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

Klusterfuk90 commented 4 years ago

Hi,

Would it be possible to share the pre-trained models and/or scripts for "Fill-in-the-Blank Text Generation" mentioned in this blog post? https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

adarob commented 4 years ago

I don't believe it will be possible to share the pretrained models. However, I can probably add the preprocessor I used. You would need to rebuild the C4 dataset or use something similar to get the best results though.

ramsrigouthamg commented 4 years ago

@franz101 The closest thing I could come up with :

from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

sentence = "Tom has fully <mask> <mask> <mask> illness."

token_ids = tokenizer.encode(sentence, return_tensors='pt')
# print(token_ids)
token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')
print(token_ids_tk)

masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)

with torch.no_grad():
    output = model(token_ids)

last_hidden_state = output[0].squeeze()

print ("\n\n")
print ("sentence : ",sentence)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
    mask_hidden_state = last_hidden_state[mask_index]
    idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
    words = [tokenizer.decode(i.item()).strip() for i in idx]
    list_of_list.append(words)
    print (words)

best_guess = ""
for j in list_of_list:
    best_guess = best_guess+" "+j[0]

print ("\nBest guess for fill in the blank :::",best_guess)

The output is : ['Tom', 'Ġhas', 'Ġfully', '', '', '', 'Ġillness', '.'] [4, 5, 6]

sentence : Tom has fully illness.

['recovered', 'returned', 'recover', 'healed', 'cleared'] ['from', 'his', 'with', 'to', 'the'] ['his', 'the', 'her', 'mental', 'this']

Best guess for fill in the blank ::: recovered from his

sohmukherjee commented 4 years ago

Is the task to fill in the blanks mentioned in the blog link not included in the pre-trained model, that has already been released?

adarob commented 4 years ago

The task was added in https://github.com/google-research/text-to-text-transfer-transformer/commit/1253c0644f1f122646df391ffe64711f9dc63358. We have not released the pretrained model at this point, although I can look into it.

RibhuRoy commented 4 years ago

It would be really helpful if you can release the pretrained model of the Fill-in-the-blanks task so that we can use it for inference and check it with our usecase quickly. Please look into it once.

iliemihai commented 4 years ago

Any news with with releasing the pretrained model :D ?

naveenjafer commented 4 years ago

@franz101 The closest thing I could come up with :

from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')

sentence = "Tom has fully <mask> <mask> <mask> illness."

token_ids = tokenizer.encode(sentence, return_tensors='pt')
# print(token_ids)
token_ids_tk = tokenizer.tokenize(sentence, return_tensors='pt')
print(token_ids_tk)

masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
print (masked_pos)

with torch.no_grad():
    output = model(token_ids)

last_hidden_state = output[0].squeeze()

print ("\n\n")
print ("sentence : ",sentence)
print ("\n")
list_of_list =[]
for mask_index in masked_pos:
    mask_hidden_state = last_hidden_state[mask_index]
    idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
    words = [tokenizer.decode(i.item()).strip() for i in idx]
    list_of_list.append(words)
    print (words)

best_guess = ""
for j in list_of_list:
    best_guess = best_guess+" "+j[0]

print ("\nBest guess for fill in the blank :::",best_guess)

The output is : ['Tom', 'Ġhas', 'Ġfully', '', '', '', 'Ġillness', '.'] [4, 5, 6]

sentence : Tom has fully illness.

['recovered', 'returned', 'recover', 'healed', 'cleared'] ['from', 'his', 'with', 'to', 'the'] ['his', 'the', 'her', 'mental', 'this']

Best guess for fill in the blank ::: recovered from his

@ramsrigouthamg Thank you for the code, but I have a concern here. When talking about predicting multiple masks and in close proximity on top of that, since the model is non sequential, it is simultaneously predicting all the masks without any knowledge of what the final prediction of the other masks would be right? In the example you have given here it seems to have come up with sensible blank filling, but in some cases that I tried out, it ends up repeating words or just outputs a complete illogical sequence of words. I was wondering if this was the case or if I was doing something wrong.

chrisrytting commented 4 years ago

@adarob Sorry to pester you once again about the weights for the fill-in-the-blank task, but we'd love to use them and wonder how likely it is that they will be released? It sounded open-ended last time you spoke on the issue. If at all likely, do you have an idea of when? Thanks!

Kostis-S-Z commented 3 years ago

@adarob @craffel any update on releasing these weights? :)