Incorrect highlighted text using search feature

stephanrauh / ngx-extended-pdf-viewer

A full-blown PDF viewer for Angular 16, 17, and beyond

https://pdfviewer.net

Apache License 2.0

473 stars 181 forks source link

Incorrect highlighted text using search feature #1720

Closed dashakureru closed 1 year ago

dashakureru commented 1 year ago

Describe the bug When using search feature from toolbar, the highlighted area is incorrect. (same issue when I tried the PDFs in the documentation)

Version info

Version of ngx-extended-pdf-viewer: 16.2.5

Desktop (please complete the following information):

Browser Edge and Chrome

Smartphone (please complete the following information):

Device: Windows PC

To Reproduce

Go to https://pdfviewer.net/extended-pdf-viewer/simple
Click on Search tool in the toolbar and search a word (example is Author) Result will be:

here's another example from this document https://pdfviewer.net/extended-pdf-viewer/pages-loaded

Screenshots See above :)

Additional context None :)

Thank you! I know from experience how much work filling a form like this is. It's always tedious and annoying. But it helps me to focus on the important points and to speed up development. So thank you very much for your understanding and your patience!

korydondzila commented 1 year ago

Quite literally the highlight is not wrapping the right part of the string, right number of characters though. So at least we know it's not just offset visually in a weird way. My guess is that some indexing change happened? Though this might be a pdf.js issue?

Screen Shot 2023-03-21 at 2 05 52 PM

stephanrauh commented 1 year ago

Oh no! I'm already familiar with this kind of bug... it always amounts to a difficult debugging session. Maybe you can give me a hint. Do you happen to know which version introduced the bug?

Just in case you want to dig into the code yourself: a likely candidate is the normalize method. It has changed recently to accomodate several non-latin languages. It's possible that these changes have side effects I didn't consider when merging.

dashakureru commented 1 year ago

Hi! sorry it took a while to get back. I'm not sure yet which version introduced the bug but I used 16.2.5. I'll try to find the last working version.

dashakureru commented 1 year ago

Hi again! last working version is 16.1.0

stephanrauh commented 1 year ago

So it's the update from pdf.js 3.3 to 3.4. Thanks! I'm sure that'll help me.

stephanrauh commented 1 year ago

@korydondzila Kory, thanks for adding your insight in this and several other issues. I appreciate this a lot!

stephanrauh commented 1 year ago

Interesting - looking for "James Boyle" shows the correct result, but "enclosing the commons" is four characters off.

stephanrauh commented 1 year ago

Your bugfix has landed with version 16.2.16.

Enjoy! Stephan

avinashgazula commented 1 year ago

Looks like this issue wasn't completely fixed. Works for a few words, but the highlight is still off for certain matches

stephanrauh commented 1 year ago

What? Oh, wait - I didn't update the showcase. Maybe that's the problem? On my machine, it looks better:

@avinashgazula

avinashgazula commented 1 year ago

I tested the latest version in my application and it seems like it works for some words and is off by a few characters for some. I've searched for "financial" here

stephanrauh commented 1 year ago

The PDF viewer renders the text as an image, and adds an invisible text layer above it. The text layer is used for highlighting search results and for marking text. The problem is that both layers are rendered independently. For technical reasons, they usually aren't identical. For example, you computer probably doesn't have the font the PDF file needs. So pdf.js is using a lot of heuristics and guesswork to provide a good match. Most of the time, this works, but sometimes, it doesn't. Being off half a character or even more isn't unusual.

This means that I only consider it an error if you can show me in the HTML code that the wrong text is marked. Kory demonstrated the idea nicely in her comment above. Open the developer tools, find the highlighted text in the DOM, and check whether the <span> responsible for highlighting covers the correct text or not.

Thanks in advance Stephan

avinashgazula commented 1 year ago

Looks like the span is applied to the right characters. It's just off by a few characters on my pdfs

stephanrauh commented 1 year ago

OK, I'm closing the issue again. I suppose you can reproduce the remaining issue on https://mozilla.github.io/pdf.js/web/viewer.html. If so, please file an issue at https://github.com/mozilla/pdf.js/issues.

When you do so, please be aware that the pdf.js team receives an incredible number of issues every day, so they're triaging strictly. Fill their bug report form meticulously. And keep in mind that pdf.js is the PDF viewer of Firefox. I'm using it to display PDF files in Angular, and they tolerate that, but only to a certain point. If you mention ngx-extended-pdf-viewer, they'll close the ticket without even looking at it. Report a Firefox bug, and you'll probably be fine.