TROCR: how to get the word level confidence score using the source code

tairen99 commented 1 year ago

Hi @wolfshow and all,

Thank you for your excellent work on text recognition.

I am trying to get the confidence of the word level with your source code without using the Hugging face transformer versions because it has some issues with their current git repo.

Can you please point me to some way that I can do in your "pic_inference.py"?

Thank you in advance!

tairen99 commented 1 year ago

Hi all and @liminghao1630,

Thank you again for your wonderful work!

I don't know whether we can get the word-level inference confidence from the source code. For example, in the "pic_inference.py", we can get the recognition results for words, but if some handwriting words were cursive, then, the recognition may not be right.

Do we have some way to tell whether the recognition result is good or not using some method, such as confidence?

Thank you in advance!

acharjee07 commented 1 year ago

Hi @tairen99 did you find a way to get the confidence score?

mifuegon commented 1 year ago

Hi @tairen99 @acharjee07,

Is there any option of getting that confidence score??

Thanks all

SShimmyo commented 1 year ago

@tairen99 @acharjee07 @mifuegon I would like to know how to get confidence score of TrOCR. Did you guys find a solution?

alexanderwebber commented 1 year ago

also interested

semihcanturk commented 1 year ago

So, this answer is for getting confidence scores with HuggingFace, but people here may find it useful as well, so posting anyways.

When calling model.generate() on the TrOCR VisionEncoderDecoder model, you need to set both output_scores=True and return_dict_in_generate=True in order to obtain scores:

model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten").to(device)
...
return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)

However, TrOCR by default uses greedy search to generate sequences, which does not provide a single score but a sequence of scores. See the HF docs on GreedySearchEncoderDecoderOutput: In the returned dictionary, 'sequences' is the returned sequence, and 'scores' is essentially the score for each token before softmax. However, I am not entirely sure how to construct an overall score for the whole sequence using these scores.

Instead, you can use beam search instead of greedy search in your model by changing the model config as such:

model.config.max_length = 10
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

Beam search also provides an additional 'sequence_scores' key in the returned dictionary that represents the confidence for the whole sequence. See the docs for BeamSearchEncoderDecoderOutput for more info. These sequence scores look like log-scores, so exponentiating them would give you the probabilities/confidence.

In short, you can get the confidence scores as follows (the config adjustments are from Niels Rogge's TrOCR fine-tuning tutorial):

model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

# set special tokens used for creating the decoder_input_ids from the labels
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size

# set beam search parameters
model.config.eos_token_id = processor.tokenizer.sep_token_id
model.config.max_length = 10
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

pixel_values = processor(im, return_tensors="pt").pixel_values
return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)
ids, scores = return_dict['sequences'], return_dict['sequences_scores']
generated_text = processor.batch_decode(ids, skip_special_tokens=True)[0]

I'm fairly new to HuggingFace and VisionEncoderDecoder models as well, so I hope others improve on this answer, but hope this helps.

mohammad-sajeel commented 1 year ago

@semihcanturk Even though my model is configured with the settings as you provided:
trocr_model.config.eos_token_id = trocr_processor.tokenizer.sep_token_id
trocr_model.config.max_length = 10 trocr_model.config.early_stopping = True trocr_model.config.no_repeat_ngram_size = 3 trocr_model.config.length_penalty = 2.0 trocr_model.config.num_beams = 4

still for a batched input of(128 images at once), i still am getting the GreedySearchEncoderDecoderOutput and thus not receiving the scores. For a single image it is working fine though. Also im using GPU for the computation for your information

CrasCris commented 6 months ago

So, this answer is for getting confidence scores with HuggingFace, but people here may find it useful as well, so posting anyways.

When calling model.generate() on the TrOCR VisionEncoderDecoder model, you need to set both output_scores=True and return_dict_in_generate=True in order to obtain scores:
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten").to(device)
...
return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)
However, TrOCR by default uses greedy search to generate sequences, which does not provide a single score but a sequence of scores. See the HF docs on GreedySearchEncoderDecoderOutput: In the returned dictionary, 'sequences' is the returned sequence, and 'scores' is essentially the score for each token before softmax. However, I am not entirely sure how to construct an overall score for the whole sequence using these scores.

Instead, you can use beam search instead of greedy search in your model by changing the model config as such:
model.config.max_length = 10
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4
Beam search also provides an additional 'sequence_scores' key in the returned dictionary that represents the confidence for the whole sequence. See the docs for BeamSearchEncoderDecoderOutput for more info. These sequence scores look like log-scores, so exponentiating them would give you the probabilities/confidence.

In short, you can get the confidence scores as follows (the config adjustments are from Niels Rogge's TrOCR fine-tuning tutorial):
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

# set special tokens used for creating the decoder_input_ids from the labels
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size

# set beam search parameters
model.config.eos_token_id = processor.tokenizer.sep_token_id
model.config.max_length = 10
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

pixel_values = processor(im, return_tensors="pt").pixel_values
return_dict = model.generate(pixel_values, output_scores=True, return_dict_in_generate=True)
ids, scores = return_dict['sequences'], return_dict['sequences_scores']
generated_text = processor.batch_decode(ids, skip_special_tokens=True)[0]
I'm fairly new to HuggingFace and VisionEncoderDecoder models as well, so I hope others improve on this answer, but hope this helps.

I got a tensor value, but when i try to convert using softmax function i always get a tensor([1.]), maybe its for the gready search that TrOCR does, cuz the tensor value its to low.

i did it with the generated_ids = OCR_model.generate(pixel_values,output_logits=True, return_dict_in_generate=True) ids,scores = generated_ids['sequences'] ,generated_ids['logits'] and then concat =torch.cat(scores, dim=0) confianza = (torch.mean(torch.max(F.softmax(concat, dim=1), dim=1).values)).item()

microsoft / unilm

TROCR: how to get the word level confidence score using the source code #955