handle_impossible_answer not working in the question answering pipeline for ROBERTa model

Environment info

Platform: Linux 20.04
Python version 3.8.5
transformers version 3.5.0 and 4.3.2

The issue

I'm using the pipeline("question-answering") with QA Models downloaded from community. I'm evaluating models on the SQUAD 2.0 dataset which doesn't always have an answer to the given question - that's what the handle_impossible_answer flag in the pipeline is for.

I noticed that ROBERTa model (any ROBERTa, not just a specific model) in version 4 of transformers always produces an answer despite the handle_impossible_answer flag - even if the same model for the same example didn't produce an answer (returned "" as an answer) while using version 3 of the library.

bert_model_name = 'deepset/bert-base-cased-squad2'
roberta_model_name = 'deepset/roberta-base-squad2'

bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = AutoModelForQuestionAnswering.from_pretrained(bert_model_name, return_dict=True)
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_model_name)
roberta_model = AutoModelForQuestionAnswering.from_pretrained(roberta_model_name, return_dict=True)

bert_qa = pipeline('question-answering', tokenizer=bert_tokenizer, model=bert_model)
roberta_qa = pipeline('question-answering', tokenizer=roberta_tokenizer, model=roberta_model)

# Random SQUAD 2.0 example which doesn't have an answer to the question
question = 'What was the name of the only ship operating in the Indian Ocean?'
context = 'In September 1695, Captain Henry Every, an English pirate on board the Fancy, reached the Straits of Bab-el-Mandeb, where he teamed up with five other pirate captains to make an attack on the Indian fleet making the annual voyage to Mocha. The Mughal convoy included the treasure-laden Ganj-i-Sawai, reported to be the greatest in the Mughal fleet and the largest ship operational in the Indian Ocean, and its escort, the Fateh Muhammed. They were spotted passing the straits en route to Surat. The pirates gave chase and caught up with Fateh Muhammed some days later, and meeting little resistance, took some £50,000 to £60,000 worth of treasure.'

print(bert_qa(question=question, context=context, handle_impossible_answer=True))
# transformers 3.5.0: {'score': 0.999398410320282, 'start': 0, 'end': 0, 'answer': ''}
# transformers 4.3.2: {'score': 0.999398410320282, 'start': 0, 'end': 0, 'answer': ''}

print(roberta_qa(question=question, context=context, handle_impossible_answer=True))
# transformers 3.5.0: {'score': 0.979897797107697, 'start': 0, 'end': 0, 'answer': ''}
# transformers 4.3.2: {'score': 0.222181886434555, 'start': 422, 'end': 436, 'answer': 'Fateh Muhammed'}

Probable issue reason

I've found out that in the question_answering.py file in the pipeline directory in version 4 of transformers there is a condition that provides ROBERTa models from adjusting the p_mask for this task. It looks simply like this: if self.tokenizer.cls_token_id. And since ROBERTa's cls_token_id = 0 the condition isn't met and the p_mask isn't changed for the cls_token. This results in omitting the token while answering the question (it behaves like e.g the token was a part of a question). For example BERT's cls_token_id = 101 so the condition is met.

Plausible solution

Possibly the easy solution is to expand the condition to if self.tokenizer.cls_token_id is not None. However, there wasn't such a condition in version 3 at all so maybe it performs some crucial function in its current form that I'm not aware of...

# originally the condition here was more general and looked like this
# if self.tokenizer.cls_token_id:

if self.tokenizer.cls_token_id is not None:  
    cls_index = np.nonzero(encoded_inputs["input_ids"] == self.tokenizer.cls_token_id)
    p_mask[cls_index] = 0

huggingface / transformers