neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

AttributeError: You tried to generate sequences with a model that does not have a LM Head. #6

Closed paulogaspar closed 4 years ago

paulogaspar commented 4 years ago

Estou a ter um erro quando uso o vosso modelo e o código sugerido na wiki:

import torch from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-portuguese-cased_pytorch_checkpoint') model = AutoModel.from_pretrained('bert-base-portuguese-cased_pytorch_checkpoint')

input_ids = torch.tensor(tokenizer.encode('O que eu mais quero é ')).unsqueeze(0) outputs = model.generate(input_ids=input_ids, do_sample=True, num_beams=5, num_return_sequences=1, temperature=1.5)

O erro é:

Traceback (most recent call last): File "bertpt.py", line 20, in outputs = model.generate(input_ids=input_ids, do_sample=True, num_beams=5, num_return_sequences=1, temperature=1.5) File "/home/paulo/.local/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad return func(*args, **kwargs) File "/home/paulo/.local/lib/python3.6/site-packages/transformers/modeling_utils.py", line 682, in generate "You tried to generate sequences with a model that does not have a LM Head." AttributeError: You tried to generate sequences with a model that does not have a LM Head.Please use another model class (e.g. OpenAIGPTLMHeadModel, XLNetLMHeadModel, GPT2LMHeadModel, CTRLLMHeadModel, T5WithLMHeadModel, TransfoXLLMHeadModel)

paulogaspar commented 4 years ago

Se usar em vez disso:

model = AutoModelWithLMHead.from_pretrained('bert-base-portuguese-cased_pytorch_checkpoint')

O resultado descodificado é este:

Generated 0: o que eu mais quero e no no no no no no no no no no no Generated 1: o que eu mais quero e no no no no no no no no no no no Generated 2: o que eu mais quero e no no no no no no no no no no no

fabiocapsouza commented 4 years ago

Hi Paulo,

I believe the checkpoints were uploaded without the LM Head weights by my mistake. I'll confirm if that is really the case and I'll update the links. Thanks for reporting this issue

paulogaspar commented 4 years ago

Hey! Thanks for replying. Did you update it afterall ?

fabiocapsouza commented 4 years ago

Hi Paulo,

I inspected the uploaded models and they are ok, they are not missing the weights of any layer. The issue is that BERT models are not compatible with the generate method you tried to use. That is because BERT is only an encoder, so it does not have a LM decoder that is needed to generate sequences from left to right. BERT can generate sequences if you use the masked language model scheme it was trained on: masking some tokens of the text and asking BERT to predict them back. Here is an example:

import torch
from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained('neuralmind/bert-base-portuguese-cased')
tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

def predict_masked_text(model, tokenizer, masked_text): 
    input_ids = tokenizer.encode(masked_text, return_tensors='pt') 
    device = next(model.parameters()).device 
    token_logits = model(input_ids.to(device))[0]
    pred_tokens = token_logits.argmax(dim=-1)[0, 1:-1]  # Remove [CLS] and [SEP] included by tokenizer

    return tokenizer.decode(pred_tokens)

# Masking the sentence "Apesar de eu já ter almoçado, eu ainda estava com fome."
masked_text = '[MASK] de eu já ter almoçado, eu ainda estava com [MASK].'

predict_masked_text(model, tokenizer, masked_text)
# 'Apesar de eu já ter almoçado, eu ainda estava com fome'

It is also possible to get the top-k predictions:

masked_text = '[MASK] de eu já ter almoçado, eu ainda estava com [MASK].'
input_ids = tokenizer.encode(masked_text, return_tensors='pt') 
masked_ixs = torch.nonzero(input_ids[0] == tokenizer.mask_token_id).squeeze()
device = next(model.parameters()).device 
token_logits = model(input_ids.to(device))[0][0]

topk_tokens = torch.topk(token_logits[masked_ixs], 10).indices

for masked_pos in topk_tokens:
    print(tokenizer.decode(masked_pos))

# Apesar Depois Antes Além apesar depois antes além Mesmo Independente
# fome sono ela medo vontade sede ele frio problemas eles