microsoft / OCR-Form-Tools

A set of tools to use in Microsoft Azure Form Recognizer and OCR services.
MIT License
509 stars 174 forks source link

selectionMark not consistently recognized bug report #609

Open smf723 opened 3 years ago

smf723 commented 3 years ago

Describe the bug I'm attempting to use the new type selectionMark with checkbox fields on my forms (ACORD 28 insurance) that I'm trying to parse. On some of the form samples I can define a field as a selectionMark type and others it will not accept the type. I get popup message in the top right corner that states "Tag type is not compatible with this feature. If you want to change type of this tag, please remove or reassign all labels using this tag in your project.". If I use a different sample form then I can change the type to selectionMark.

I think a different approach is needed for labeling selectionMark fields. I think the labeling tool should allow the selection of any rectangle (e.g. box) on the form as a selectionMark field. Seleciton of the field would be any sort of text or image within the rectangle.

Since there is inconsistencies between my sample forms of not allowing me to define the same type for the same field (e.g. selectionMark vs string) I had to revert to using V2.0. Any field should be able to be defined as string and not be forced to be labeled as a selectionMark.

To Reproduce Steps to reproduce the behavior: I'm training using ACORD 28 forms which you can find many samples with a Bing search. The specific example checkbox I was using was the "Special" field under the "Coverage" section. Some samples will recognize the field as a selectionMark and others will simply be a string. If you are not able to get two sample forms to replicate this issue please let me know and I will send you some samples.

Expected behavior No field should ever be "forced" to be a selecitonMark type. I should always have the option to make it a string type. I should be able to label any rectangle as a selectionMark type.

Desktop (please complete the following information):

RichWK commented 3 years ago

I'm coming across a similar issue. I'm running a (custom) form through the FOTT OCR using text-based PDFs. There are tons of checkboxes on the page, probably around 100 or so. And while the majority of them are recognised just fine, there are about 5 or so which persistently go unrecognised.

The only way I've found to get them recognised is by using image-based PDFs (of the same form). Then they're correctly identified as selectionMarks.

But the trouble is... after composing a model and uploading a new, unseen, text-based PDF version of the form... those unrecognised checkboxes still aren't picked up.

I'm not sure what the best answer to this is — perhaps with the technology being a preview, the recognition still needs refining? It just seems very odd that they're not being picked up, because as far as I can tell they appear identical to the others.

This will really be a game-changing technology for us once it's working right, so I'd love for this to be resolved!

(Quick example of the original text-based PDF and the recognition as displayed within the 'Analyze' pane. You can see the two checkboxes in the bottom-right not having been identified.)

Original, text-based PDF

1 — Original text-based PDF

'Analyze' results

2 — 'Analyze' results

xinase commented 3 years ago

thank you for your feedback, we're working on an improved version of our algorithm, which will address this and some other issues. we will keep update this comment when the new version is ready for testing.

smf723 commented 3 years ago

This is why I think they need a different approach for checkbox (selectMark) fields. I should be able to label a selectionMark on a form whether it's selected or not. They need the tool to recognize any rectangle on the page and allow you to label it as a selectionMark. This would probably require a different "mode" while editing (e.g. rectangle selection mode) and then every rectangle on the page would then become "selectable" to be labeled. Obviously this is a significant change to just relying on the underlying OCR engine. Another potential advantage to recognizing rectangles is it could be used to identify other field types (e.g. string, number, etc.). This would make it easier to label large text fields (e.g. comments section) without having to have a bunch of sample forms.

BTW, Rich is that form for property assessments? After we get done with our ACORD forms we're going to look at building inspection forms. The sky is the limit when you start think about all of the possible form types.

RichWK commented 3 years ago

thank you for your feedback, we're working on an improved version of our algorithm, which will address this and some other issues. we will keep update this comment when the new version is ready for testing.

Thanks for the reply @xinase! Looking forward to checking it out. Is that improved version targeted for the next release or is it longer term work?

RichWK commented 3 years ago

BTW, Rich is that form for property assessments? After we get done with our ACORD forms we're going to look at building inspection forms. The sky is the limit when you start think about all of the possible form types.

I probably shouldn't comment on the exact nature of the form, but you're in the right ballpark. We're also in the same position of having many other forms that we could apply this technology to now that checkbox support has arrived. I'm excited for it to be working fully!

stew-ro commented 3 years ago

To clarify, our current logic is to only allow selection marks recognized by OCR (identified by pink/red bounding boxes on the document) to be tagged to a selection mark tag (identified by clicking the tag drop down icon) or an unlabeled tag. This way, a tag will only have selection mark labels which is required for training selection marks. image image

Currently, if there's any inconsistencies with this logic, it's a bug and please let us know.

Like @xinase mentioned, we're working on improving this issue. We'll follow up with a release timeline and more details when available.

xinase commented 3 years ago

thank you for your feedback, we're working on an improved version of our algorithm, which will address this and some other issues. we will keep update this comment when the new version is ready for testing.

Thanks for the reply @xinase! Looking forward to checking it out. Is that improved version targeted for the next release or is it longer term work?

There is always a next version, even after the immediate next version. :) One thing we realize is that there are many different kinds of selectionMarks, we will improve our model with more data to increase our coverage overtime, while maintain a good level of accuracy.

as a workaround today, during training, you can pick files whose selectionMarks are in "selected" mode, so that it's more likely to be detected, therefore you can label them.

RichWK commented 3 years ago

as a workaround today, during training, you can pick files whose selectionMarks are in "selected" mode, so that it's more likely to be detected, therefore you can label them.

Ooh yes, great idea, I hadn't thought of that! Thanks, I'll give it a try.

smf723 commented 3 years ago

Attached is a sample ACORD 28 insurance form with all of the possible checkbox fields selected. When I attempt to train with this form not all of the checkboxes are available as selectionMark. In fact if you click on some of the ones in the center of the form, under the Yes, No, N/A titles you will see the entire column of checkboxes will be selected. image Yet in other cases an individual checkbox can be selected as a selectionMark. image

ACORD 28 (2014_01) All.pdf

mronda commented 3 years ago

I am coming up to this bug as well. What would be a work-around if the checkboxes are inside a cell of a table? After I run OCR on my PDF it only detects the checkboxes inside the cell as text and it does not let me tag the selection boxes as selectionMark.

xinase commented 3 years ago

I am coming up to this bug as well. What would be a work-around if the checkboxes are inside a cell of a table? After I run OCR on my PDF it only detects the checkboxes inside the cell as text and it does not let me tag the selection boxes as selectionMark.

We're working on this issue, will update you when we have a new release, hopefully soon.

mronda commented 3 years ago

@xinase Can you update here if the fix is added? I appreciate it !

xinase commented 3 years ago

@mronda @RichWK @smf723 thank you for your feedback, as I mentioned, we are working on improvements in this area. In the meantime, you could also open a support ticket via Azure portal to get direct support, this way, you could safely share testing data with MSFT product team.

smf723 commented 3 years ago

@mronda @RichWK @smf723 thank you for your feedback, as I mentioned, we are working on improvements in this area. In the meantime, you could also open a support ticket via Azure portal to get direct support, this way, you could safely share testing data with MSFT product team.

I already uploaded a sample PDF. Are you looking for more than that? If I open a ticket will it be forward to your team? Should I reference this github issue?

xinase commented 3 years ago

@mronda @RichWK @smf723 thank you for your feedback, as I mentioned, we are working on improvements in this area. In the meantime, you could also open a support ticket via Azure portal to get direct support, this way, you could safely share testing data with MSFT product team.

I already uploaded a sample PDF. Are you looking for more than that? If I open a ticket will it be forward to your team? Should I reference this github issue?

if you open a ticket we will be looped in. yes you can reference this issue. But if you don't have more issues or test data to share, you don't need to open a ticket, we collected the info in the current thread and logged in our system.

thanks

smf723 commented 3 years ago

So to open a support ticket you have to have a support plan which is at the lowest level is $29/month. You can find many sample ACORD PDF forms on the Internet by simply searching for "ACORD 28". Some of the forms are "fillable" PDF's where you can experiment with different checkboxes being selected. Also note that although the form itself is static, the various software packages (e.g. Property & Casualty agency software systems) that generate the form content are different. For instance, some simply use a lower or upper case letter X to indicate a checkbox is selected. Others use a solid square or circle to indicate selection while others use a literal checkmark graphic to indicate selection. Again if you search the Internet for the form type you will see many variations of the form content. Another ACORD form we are using is "ACORD 25" and again if you search you will find many samples including editable versions. If you need specific links to examples just let me know.

RichWK commented 3 years ago

Hi again @xinase — I noticed the November release included this line in its release notes:

"Quality improvements - Extraction improvements including single digit extraction improvements."

Does that include improvements to checkbox recognition as discussed here? Or are the improvements for that still to be included in a future release?