Unable to find and highlight all requested sentences

lauragug commented 2 years ago

Hello, I was looking for an automated way to highlight specific sentences in a pdf and I was very happy to find your work. I am identifying sentences based on affinity to topics using an NLP tool and producing the dictionary to use in highlighter.highlight. However, I find that highlight is not able to find all the sentences but only a subset of them, and I cannot understand what the issue might be.

For example, for the attached document: P011117.pdf I use the dictionary: [('#255', 'Regression algo(.|\n)+utcomes\).'), ('#390', 'Likewise, the a(.|\n)+k models.'), ('#105', 'Firms usually h(.|\n)+tivities.'), ('#397', '2 Model risk ma(.|\n)+visible.'), ('#53', 'research highli(.|\n)+be used.'), ('#255', 'In order to sel(.|\n)+election.'), ('#397', 'In order to sup(.|\n)+e model\).'), ('#105', 'For example, cu(.|\n)+nd where?'), ('#394', 'Some supervisor(.|\n)+s launch.'), ('#43', 'Such liability(.|\n)+damages.')]

Output of highlight is [('#255', (0.914, 0.118, 0.388), 9, 70.91999999999985, 614.436, 527.52, 641.436), ('#105', (0.129, 0.588, 0.953), 17, 70.91999999999996, 665.436, 527.5188, 692.436), ('#255', (1.0, 0.757, 0.027), 26, 70.92000000000002, 695.436, 527.5200000000001, 722.436), ('#397', (0.298, 0.686, 0.314), 31, 70.91999999999996, 290.436, 527.5211999999999, 362.436), ('#394', (0.404, 0.227, 0.718), 37, 70.9199999999999, 368.436, 527.52, 410.436)] that is only 5 out of 10 sentences have been found. As you can see I am using regex match for multiline, because I found it more reliable than basic multiline match.

Any suggestion about how to improve the reliability of sentence matching is greatly appreciated. Would really love to be able to use txtmarker! thank you!

davidmezzetti commented 2 years ago

Thank you for the report and including a sample file. Often I've found there are additional spaces between characters or other control characters that trip things up. It's possible there is now a better way to identify text coordinates and with the next iteration of txtmarker, that would be part of the plan.

I'll keep this issue open for reference but unfortunately, I don't have a quick fix for this particular issue.

lauragug commented 2 years ago

Thank you for your comment!

neuml / txtmarker

Unable to find and highlight all requested sentences #10