timpaul / form-extractor-prototype

A prototype of a tool that generates web forms from document forms
MIT License
367 stars 56 forks source link

Tool sometimes misinterprets multiple-choice questions #15

Open timpaul opened 2 months ago

timpaul commented 2 months ago

Accurately interpreting multiple choice questions (beyond simple yes/no) is a challenge. Lets capture examples of the tool successfully and unsuccessfully doing this, to determine how we might improve the performance.

timpaul commented 2 months ago

Here's a partially successful example for this image:

image

The different options for question 20 in the doc have been correctly parsed, but the hint text was not.

timpaul commented 2 months ago

Another mostly successful example from the same form as above:

image

The options for question 23 in the form were correctly determined, as was the fact that only one response is allowed.

The conditional date fields were not picked up, but this isn't surprising as the multiple-choice component doesn't support them.

This is a good example of where you might choose to structure this question differently in the web version anyway, using multiple pages and routing.

timpaul commented 2 months ago

Here's an example of it getting it wrong, from the same form:

image

It made 2 errors:

  1. It treated the hint text as the first option
  2. It assumed only one response was allowed

What's interesting (and frustrating) is that the question is nearly identical to this one, which was successfully parsed.

It does occasionally get it right:

image

timpaul commented 2 months ago

Here's another example of a mostly successful extraction, from question 42 of this image:

image

The hint text isn't carried over, and is added to the question title.

timpaul commented 2 months ago

It's now getting an isolated version of this example right:

image
timpaul commented 2 months ago

Another fail, from this image:

image

It chose checkboxes instead of radios. I wonder if I can get it to understand the difference based on the hint text?...

timpaul commented 2 months ago

Yes, I can!

image

This was fixed in this commit by adding the following to the description text for the answer_type object in the schema:

If any part of the question contains text like 'Tick the boxes...' it's a multiple_choice question.

I'd tried a few other variants before finding one that worked, which is interesting. I think what made it work was the confidence of the statement. Saying if any part of the question, and that it is (rather than probably is). Also expressing it as a standalone sentence, rather than appending it as a clause to another sentence.

Notice that the question in the example doesn't contain the exact text that I cite in the schema, but it still matches.