sumatrapdfreader / sumatrapdf

SumatraPDF reader
http://www.sumatrapdfreader.org
GNU General Public License v3.0
12.93k stars 1.67k forks source link

Search: Match whole word Button #4295

Open mrx23dot opened 3 weeks ago

mrx23dot commented 3 weeks ago

Match Case button is nice, but when we search in datasheets it would be useful to have a match whole word Button next to it,

like see the search option in VScode.

implemented as: r\bXXX\b Regexp would be overkill, but this could be nice addition.

or e.g. this is Firefox search 2024-06-01_122449_6m

GitHubRulesOK commented 3 weeks ago

There is a problem with PDF as it does not store text as human words with tokens only as human numbers. In many ways PDF is under the surface much like OCR (single digital numerically different characters that are recognisable) So what defines words in one language PDF is not the same in any other human language.

What SumatraPDF does is allow for letters that are humanised between two apparent spaces to be conceptually selected as one word. so a double click one letter looks for before and after which is not 100% reliable but works for Europeans and some Asian Languages so related to https://github.com/sumatrapdfreader/sumatrapdf/issues/410

Here it works it cannot find search as its not a whole word. image

Here we find s then e then a then r ... so without that INITIAL defining space all S are equal image

So the CODE does use Whole Word concept of begin and end spaces (simply fails some of the time with punctuation) There are several open issues. So match word in a PDF is not a useful feature. In many cases (more reliable thus important) is Match Case and that is the only essential icon as there is no easy hotkey for toggling Shift Key except an extra icon alongside the find box.

mrx23dot commented 3 weeks ago

If we know how to search for simple text then we only need to filter the results a bit more, exclude the ones where text can be extended with: [a-z]TEXT[a-z] this would be close enough approximation of \b boundary.

It is useful e.g. in every engineering/academic field where where you need to search for short expressions like ADC.

The button could be a dropdown multi selection kind if you don't want visual clutter. (or user selectable button from options)

GitHubRulesOK commented 3 weeks ago

there is no "easy" way with document formats like PDF. there is no "searchable" simple text block like in a Text file.

Here is ONE PDF text line

image

and just the first word "This" from second line (PDF is a highly inefficient way to store text or images) from that exceptionally "word" orientated PDF file (it is rare to see word spaced correctly except after OCR) the Whole Word "search" is highlighted in blue image

Feel free to add the code needed to unclutter PDF structure and all PRs are welcome but not for just icons.

mrx23dot commented 3 weeks ago

So here you searched for "search"

GitHubRulesOK commented 3 weeks ago

I used an ANSI encoded example as it is easier to see that characters in a PDF are numeric not the ABC SHAPES on the surface and as said that is actually a very very rare simple case where OCR has nicely used a "word dictionary" to keep the letters in order. That "word" search could just as easily have been written by PDF writer and need find using "hcraes" or even "aches" as letters dont have to be in order nor adjoining.

I have a test file to show search A a a a b b c d r r is needed for finding Abracadabra since the gaps between letters and the numeric encodings used are far more extensive than you might think.

Decompress any PDF into its rendering WinANSI and you will rarely find two using a similar means to write text. So some may be seen as (s)-2(e)-4(a)1(r)2(c)-2(h) others may on text extract be seen with a virtual line feed between each letter.

The majority of PDF work well simply with " search ", as the whitespace matches void letters or non a-Z when the numbers are thingified into letter strings when converted into a single line.

The more complex the search the slower it becomes, so many have asked for ignore accents but that then requires checking every global variation of accented vowels and other shapes to "guess" a match.

mrx23dot commented 3 weeks ago

I don't understand the reasoning, if the current search can find "word" (even though they are hidden internal characters/encoded differently) then all we need to do is to check one character before and after to see if it's part of a bigger word or not.

If the current search cannot find the phrase "word" then the search is already unusable, nothing we can do. (which is rare)

Searching over many pages already takes seconds, but we would only filter the simple result even further, it wouldn't affect round one findings. And even then round two would be optional. Still faster than human evaluation.

We could check for delimiter characters instead, e.g. [.,(/!>?], more universal.

GitHubRulesOK commented 3 weeks ago

I am not the developer so cannot say how a simple viewer could be enhanced in such a way to provide a full word search function. But it is normally needing (for PDF especially) a more dedicated code base often found in larger editors (which SumatraPDF is not but Adobe/Foxit/PDFium/Edge/Tracker Readers are.) Chromium Edge can fully edit a PDF as it is based on Skia editor code, but is limited in ability to the simpler ones. SumatraPDF can only currently append annotation as incremental writing, although it sits on a MuPDF render and editing code base. Expanding to alter PDF internals would be a drastic uplift in functionality, here I used Tracker Editor to produce Words from a non searchable Edge printout!

image