NER results don't make sense when I fine model

I got bad results for NER on small model after "10,000" finetuning steps. Is there a way to ensure that prediction contains text from the input only like in NER here? Otherwise the results seem to make no sense.

These are some of the random results:

Input: ner: how old is musician paul mayer Target: paul mayer; Prediction: impeller; Counted as Correct? False

Input: ner: rectifier stack Target: rectifier stack; Prediction: rabbits; Counted as Correct? False

Input: ner: the call of the simpsons Target: call of the simpsons; Prediction: tax shelters; Counted as Correct? False

Input: ner: three rivers casino ranking Target: three rivers casino; Prediction: tom reed; Counted as Correct? False

Input: ner: social welfare department Target: social welfare department; Prediction: bmw 135i; Counted as Correct? False

Input: ner: stationary front Target: stationary front; Prediction: the tudors; Counted as Correct? False

Input: ner: aquilaria Target: aquilaria; Prediction: czechslovakia; Counted as Correct? False

Input: ner: what is a point guard Target: point guard; Prediction: joe sugg; Counted as Correct? False

Input: ner: how do second cousins work Target: second cousins; Prediction: greenfield central; Counted as Correct? False

Input: ner: the w las vegas Target: w las vegas; Prediction: wayland university; Counted as Correct? False

google-research / text-to-text-transfer-transformer

NER results don't make sense when I fine model #366