Open tairen99 opened 1 year ago
Hi all and @liminghao1630,
Thank you again for your wonderful work!
I don't know whether we can get the word-level inference confidence from the source code. For example, in the "pic_inference.py", we can get the recognition results for words, but if some handwriting words were cursive, then, the recognition may not be right.
Do we have some way to tell whether the recognition result is good or not using some method, such as confidence?
Thank you in advance!
Hi @tairen99 did you find a way to get the confidence score?
Hi @tairen99 @acharjee07,
Is there any option of getting that confidence score??
Thanks all
@tairen99 @acharjee07 @mifuegon I would like to know how to get confidence score of TrOCR. Did you guys find a solution?
also interested
So, this answer is for getting confidence scores with HuggingFace, but people here may find it useful as well, so posting anyways.
When calling model.generate()
on the TrOCR VisionEncoderDecoder model, you need to set both output_scores=True
and return_dict_in_generate=True
in order to obtain scores:
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten").to(device)
...
return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)
However, TrOCR by default uses greedy search to generate sequences, which does not provide a single score but a sequence of scores. See the HF docs on GreedySearchEncoderDecoderOutput: In the returned dictionary, 'sequences' is the returned sequence, and 'scores' is essentially the score for each token before softmax. However, I am not entirely sure how to construct an overall score for the whole sequence using these scores.
Instead, you can use beam search instead of greedy search in your model by changing the model config as such:
model.config.max_length = 10
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4
Beam search also provides an additional 'sequence_scores' key in the returned dictionary that represents the confidence for the whole sequence. See the docs for BeamSearchEncoderDecoderOutput for more info. These sequence scores look like log-scores, so exponentiating them would give you the probabilities/confidence.
In short, you can get the confidence scores as follows (the config adjustments are from Niels Rogge's TrOCR fine-tuning tutorial):
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
# set special tokens used for creating the decoder_input_ids from the labels
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size
# set beam search parameters
model.config.eos_token_id = processor.tokenizer.sep_token_id
model.config.max_length = 10
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4
pixel_values = processor(im, return_tensors="pt").pixel_values
return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)
ids, scores = return_dict['sequences'], return_dict['sequences_scores']
generated_text = processor.batch_decode(ids, skip_special_tokens=True)[0]
I'm fairly new to HuggingFace and VisionEncoderDecoder models as well, so I hope others improve on this answer, but hope this helps.
@semihcanturk
Even though my model is configured with the settings as you provided:
trocr_model.config.eos_token_id = trocr_processor.tokenizer.sep_token_id
trocr_model.config.max_length = 10
trocr_model.config.early_stopping = True
trocr_model.config.no_repeat_ngram_size = 3
trocr_model.config.length_penalty = 2.0
trocr_model.config.num_beams = 4
still for a batched input of(128 images at once), i still am getting the GreedySearchEncoderDecoderOutput and thus not receiving the scores. For a single image it is working fine though. Also im using GPU for the computation for your information
So, this answer is for getting confidence scores with HuggingFace, but people here may find it useful as well, so posting anyways.
When calling
model.generate()
on the TrOCR VisionEncoderDecoder model, you need to set bothoutput_scores=True
andreturn_dict_in_generate=True
in order to obtain scores:model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten").to(device) ... return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)
However, TrOCR by default uses greedy search to generate sequences, which does not provide a single score but a sequence of scores. See the HF docs on GreedySearchEncoderDecoderOutput: In the returned dictionary, 'sequences' is the returned sequence, and 'scores' is essentially the score for each token before softmax. However, I am not entirely sure how to construct an overall score for the whole sequence using these scores.
Instead, you can use beam search instead of greedy search in your model by changing the model config as such:
model.config.max_length = 10 model.config.early_stopping = True model.config.no_repeat_ngram_size = 3 model.config.length_penalty = 2.0 model.config.num_beams = 4
Beam search also provides an additional 'sequence_scores' key in the returned dictionary that represents the confidence for the whole sequence. See the docs for BeamSearchEncoderDecoderOutput for more info. These sequence scores look like log-scores, so exponentiating them would give you the probabilities/confidence.
In short, you can get the confidence scores as follows (the config adjustments are from Niels Rogge's TrOCR fine-tuning tutorial):
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten") # set special tokens used for creating the decoder_input_ids from the labels model.config.decoder_start_token_id = processor.tokenizer.cls_token_id model.config.pad_token_id = processor.tokenizer.pad_token_id # make sure vocab size is set correctly model.config.vocab_size = model.config.decoder.vocab_size # set beam search parameters model.config.eos_token_id = processor.tokenizer.sep_token_id model.config.max_length = 10 model.config.early_stopping = True model.config.no_repeat_ngram_size = 3 model.config.length_penalty = 2.0 model.config.num_beams = 4 pixel_values = processor(im, return_tensors="pt").pixel_values return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True) ids, scores = return_dict['sequences'], return_dict['sequences_scores'] generated_text = processor.batch_decode(ids, skip_special_tokens=True)[0]
I'm fairly new to HuggingFace and VisionEncoderDecoder models as well, so I hope others improve on this answer, but hope this helps.
I got a tensor value, but when i try to convert using softmax function i always get a tensor([1.]), maybe its for the gready search that TrOCR does, cuz the tensor value its to low.
i did it with the
generated_ids = OCR_model.generate(pixel_values,output_logits=True, return_dict_in_generate=True)
ids,scores = generated_ids['sequences'] ,generated_ids['logits']
and then
concat =torch.cat(scores, dim=0)
confianza = (torch.mean(torch.max(F.softmax(concat, dim=1), dim=1).values)).item()
Hi @wolfshow and all,
Thank you for your excellent work on text recognition.
I am trying to get the confidence of the word level with your source code without using the Hugging face transformer versions because it has some issues with their current git repo.
Can you please point me to some way that I can do in your "pic_inference.py"?
Thank you in advance!