stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
482 stars 42 forks source link

fill-mask #15

Closed savasy closed 4 years ago

savasy commented 4 years ago

When I apply fill-masl with bert-base-turkish as follows:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("dbmdz/bert-base-turkish-cased")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased)
fm=pipeline("fill-mask", model=model, tokenizer=tokenizer)
fm("merhaba ben <mask> iyiyim")

I get following error

ValueError Traceback (most recent call last)

in () ----> 1 fm("merhaba ben iyiyim") /usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in __call__(self, *args, **kwargs) 795 values, predictions = topk.values.numpy(), topk.indices.numpy() 796 else: --> 797 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item() 798 logits = outputs[i, masked_index, :] 799 probs = logits.softmax(dim=0) ValueError: only one element tensors can be converted to Python scalars
stefan-it commented 4 years ago

Hi @savasy ,

for mask filling you need a special head (language modeling head) ontop the BERT model:

https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm

When using the Auto* class it is the AutoModelWithLMHead class. Here's a full working example:

from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

model_name = "dbmdz/bert-base-turkish-cased"

model = AutoModelWithLMHead.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)
nlp("merhaba ben [MASK] iyiyim")

The masking token for BERT is [MASK], this can be checked with tokenizer.mask_token :)

Then the example should work:

[{"sequence": "[CLS] merhaba ben çok iyiyim [SEP]",
  "score": 0.4122845530509949,
  "token": 2140},
 {"sequence": "[CLS] merhaba ben daha iyiyim [SEP]",
  "score": 0.13173197209835052,
  "token": 2171},
 {"sequence": "[CLS] merhaba ben gayet iyiyim [SEP]",
  "score": 0.12043964117765427,
  "token": 7982},
 {"sequence": "[CLS] merhaba ben oldukça iyiyim [SEP",
  "score": 0.03267306089401245,
  "token": 3523},
 {"sequence": "[CLS] merhaba ben gerçekten iyiyim [SEP]",
  "score": 0.03199344128370285,
  "token": 4036}]
savasy commented 4 years ago

Aahh sorry, I used wrong Model and wrong mask token, shame on me Thank you @stefan-it , I appreciate it