Add option ot search only text files - speed up all files search

mvorisek commented 3 years ago

This is a feature request.

Searching all files is slow and very inefficient when the searched directory contains binary files.

This is a feature request to allow to seach in text files only, ie. files, that contains not NULL/0x00 bytes*.

* this is how git checks if file is binary, it scans the file and stop threating it as a text file if a null byte is found

sasumner commented 3 years ago

Of course you can mitigate this somewhat yourself by excluding known binary types, e.g. exe, obj, o, etc. using the ! operator in the Filters box.

I don't know that NUL bytes are a good detector of files to exclude.
NULs are supported well by Scintilla (Notepad++'s text editing component) and support for NULs is getting better in Notepad++ itself.

mvorisek commented 3 years ago

I mean to add an extra checkbox. Standard text files does not normally contains NULL byte and binary files on the other side very often does. This feature can reduce search times 10 - 100 x for typical computer programs folders.

mere-human commented 3 years ago

Do you want something similar to what grep does for binary files?

--binary-files=type

If a file’s data or metadata indicate that the file contains binary data, assume that the file is of type type. Non-text bytes indicate binary data; these are either output bytes that are improperly encoded for the current locale (see Environment Variables), or null input bytes when the -z (--null-data) option is not given (see Other Options). By default, type is ‘binary’, and grep suppresses output after null input binary data is discovered, and suppresses output lines that contain improperly encoded data.

https://www.gnu.org/software/grep/manual/grep.html#File-and-Directory-Selection

mvorisek commented 3 years ago

yes

sasumner commented 3 years ago

@mere-human said

Do you want something similar to what grep does for binary files?

What are your suggestions for how to specify something like this in the Notepad++ UI ?

mere-human commented 3 years ago

What are your suggestions for how to specify something like this in the Notepad++ UI ?

I thought about some check box like this

sasumner commented 3 years ago

@mere-human

Search binary files

Seems brief and to-the-point ... I like it.

Is "binary" a well-known enough term, among unsophisticated users? Would "Search non-text files" be better? (not sure even that helps the non-sophisticated)

And your definition of binary is...what? (I know you provided grep's take on this above, but how would you define it for Notepad++?)

One presumes that searching would proceed as it currently does in a file, until a file is know to be "binary", then the rest of that file is skipped, and any cached hit results for that file are discarded and not presented in the output window.

mere-human commented 3 years ago

Is "binary" a well-known enough term, among unsophisticated users?

As for me, it is quite straightforward. However, I am not an unsophisticated user by far.

Would "Search non-text files" be better? (not sure even that helps the non-sophisticated)

Well, Wikipedia is always here for curious users who don't know what that means 😄
https://en.wikipedia.org/wiki/Binary_file

Maybe, even remove the inversion and say it like "Search only text files". However, that could be confused with the file type that could be Normal text or C++ Source, etc.

mere-human commented 3 years ago

And your definition of binary is...what?

It contains NUL character somewhere in the middle of the file.
When the chunk of the file is decoded to a string, its length doesn't correspond to the byte length (considering wide char) or the decoding fails.

Actually, my idea was to look at the grep source and use that as a reference.

sasumner commented 3 years ago

@mere-human

After the discussion, the winner (at least in my view), is the original Search binary files. :-)

sasumner commented 3 years ago

@mere-human

When the chunk of the file is decoded to a string, its length doesn't correspond to the byte length (considering wide char) or the decoding fails.

Would you do a pre-test of the file, before it is loaded (as normal into an invisible view, to prep for searching) to determine binary/text? That is what the above sounds like. I suppose you have to consider what happens if a file is already loaded for normal editing by the user (in which case, the in-memory version is searched, not the file on disk).

Otherwise, the encoding detection you touched on is already handled (and is what it is :-) ) by the load process, so, without really specialized processing, it might be difficult to use this (bad encoding) as a means to declare binary/non-binary status.

sasumner commented 3 years ago

@mere-human

Are you wanting to take this on? I could assign you. :-) Best wait for a @donho go/no-go, though.

mere-human commented 3 years ago

Would you do a pre-test of the file, before it is loaded (as normal into an invisible view, to prep for searching) to determine binary/text?

I have to do the research and think about it. But even if we had only the NUL byte test, that would be much better than nothing.

Yes, I can work on this if it's okay.

mere-human commented 3 years ago

Actually, I have one more idea about binary file detection:

It contains NUL character somewhere in the middle of the file.

When the chunk of the file is decoded to a string, its length doesn't correspond to the byte length (considering wide char) or the decoding fails.

Look at the file extension.

We can filter out some commonly known binary extensions like .exe, .dll, .mp4, etc.

Also, this could be a separate feature - "exclude extensions". This is widely used in some programs. For example, I like this search dialog in Double Commander: It even has "Find files NOT containing text" option.

Some of those are specific to the program type (Double Commander is a file browser) and won't be as useful in the text editor. But include/exclude lists seem particularly useful to me. And that can be hidden by default (e.g. user presses "Advanced..." button to see it).

sasumner commented 3 years ago

We can filter out some commonly known binary extensions like .exe, .dll, .mp4, etc. Also, this could be a separate feature - "exclude extensions".

I wonder if the binary known extensions could be on the Searching page in the Preferences. Real estate in the Find dialog is tight and is always controversial when considering change. Or some other great idea about UI...("Advanced" button mentioned)

It even has "Find files NOT containing text" option.

In N++ it could make for a bit different look to output, as currently output is Search string - File - Individual Hits - (repeat File + Hits). With this we'd just have Search string - File(s). Not really a problem.

But include/exclude lists seem particularly useful to me.

I think the trouble with this is coming up with a good UI for it. For including files, e.g. *.txt or excluding files, e.g. !*.exe, the existing UI is workable. But when it gets more complex...hmmm. Which is probably the reason that #2433 hasn't seen any action on excluding folders.

Really should have @donho 's opinion on all this. My comments are just brainstorming and Don and I often differ in opinion on UI concerns, especially in the Find area.

mere-human commented 3 years ago

We can filter out some commonly known binary extensions like .exe, .dll, .mp4, etc. Also, this could be a separate feature - "exclude extensions".

I wonder if the binary known extensions could be on the Searching page in the Preferences.

To me, it doesn't fit there. This list seems rather dynamic. And even if we try hard, we couldn't cover all the binary extensions well. I would say, we should either do the content tests (NUL + encoding) or use the exclude extensions.

Real estate in the Find dialog is tight and is always controversial when considering change. Or some other great idea about UI...("Advanced" button mentioned)

It even has "Find files NOT containing text" option.

In N++ it could make for a bit different look to output, as currently output is Search string - File - Individual Hits - (repeat File + Hits). With this we'd just have Search string - File(s). Not really a problem.

But include/exclude lists seem particularly useful to me.

I think the trouble with this is coming up with a good UI for it. For including files, e.g. *.txt or excluding files, e.g. !*.exe, the existing UI is workable. But when it gets more complex...hmmm. Which is probably the reason that #2433 hasn't seen any action on excluding folders.

Thanks for pointing this out. That issue has a really interesting discussion.

Really should have @donho 's opinion on all this. My comments are just brainstorming and Don and I often differ in opinion on UI concerns, especially in the Find area.

mere-human commented 3 years ago

Just 2 more cents about the "Advanced..." UI options: Visual Studio search has the expanding sections and the [...] button that allows more fine-grained tuning.

collapsed
expanded
folder selection

We could incorporate some of that ideas.

mere-human commented 3 years ago

But in this issue, I'd rather concentrate on a single checkbox Search binary files, and the detection of binary files.

sasumner commented 3 years ago

But in this issue, I'd rather concentrate on a single checkbox Search binary files, and the detection of binary files.

Well, you were the one that got me talking way beyond that! ;-)

Marmotian commented 3 years ago

I'd just like to point out that classifying files with NUL characters as binary is a poor choice because many Unicode files contain NUL as every other byte. That is, unless you actually want to exclude Unicode files.

mere-human commented 3 years ago

I'd just like to point out that classifying files with NUL characters as binary is a poor choice because many Unicode files contain NUL as every other byte.

That was only one of the tests I came up with. For Unicode, one can still check if a symbol is within a valid Unicode range or not. Besides, many Unicode files contain BOM.

mvorisek commented 3 years ago

Unicode is mostly encoded as UTF-8 which never contains NULL byte for any non-zero/NUL codepoint (same for ASCII).

donho commented 3 years ago

It's an attractive feature, so I did google for detecting binary files in native way (Windows' way). I found nothing but only this: There is no really 100% way. You would need to use some kind of heuristic. https://stackoverflow.com/questions/2923280/detecting-if-a-file-is-binary-or-plain-text

The suggestion from @mere-human is kind of intuitive, but maybe too simple - which could lead bad detection.

Look at the file extension.

This might be the most stable way since if you rename a text file with extension exe, you may not want to search it. But, how many known binary extensions we can use for filtering? A big amount. How many unknown binary extensions exist? Infinite.

So in my opinion, the binary detection is a core functionality that we cannot control 100%. And it could lead some critical issues because some important files are ignored because of a bad detection.

donho commented 3 years ago

OTOH, Notepad++ does detect Unicode files already, with or without BOM. Based on that, if NULL is detected in a non-Unicode file, then it could be a binary - we have to process furthermore more sophisticated algorithm for the detection.

@mere-human could you elaborate this ?

When the chunk of the file is decoded to a string, its length doesn't correspond to the byte length (considering wide char) or the decoding fails.

Marmotian commented 3 years ago

A good place to start might be the *nix utility 'file' that was designed for this specific purpose. The end of the man page says:

"You can obtain the original author's latest version by anonymous FTP on ftp.astron.com in the directory /pub/file/file-X.YZ.tar.gz"

I'm sure you can also pull it from the Linux source tree as well.

donho commented 3 years ago

@Marmotian If file can be integrate into Notepad++, then we have all :) But I'm sure that file need to be rewritten to adapt to windows system's binary files.

donho commented 3 years ago

Just gave some tries on file which comes with Git Bash. The result is very impressive:

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/src/lesDlgs.h
./PowerEditor/src/lesDlgs.h: C source, ASCII text, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./BUILD.md
./BUILD.md: ASCII text, with very long lines, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./.gitignore
./.gitignore: ASCII text, with CRLF, LF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/bin/notepad++.exe
./PowerEditor/bin/notepad++.exe: PE32 executable (GUI) Intel 80386, for MS Windows

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/bin/SciLexer.dll
./PowerEditor/bin/SciLexer.dll: PE32 executable (DLL) (GUI) Intel 80386, for MS Windows

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/src/NppIO.cpp~RF2b5794.TMP
./PowerEditor/src/NppIO.cpp~RF2b5794.TMP: C++ source, ASCII text, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/src/config.4zipPackage.xml
./PowerEditor/src/config.4zipPackage.xml: XML 1.0 document, ASCII text, with very long lines, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/build/npp.7.9.2.Installer.exe
./PowerEditor/installer/build/npp.7.9.2.Installer.exe: PE32 executable (GUI) Intel 80386, for MS Windows, Nullsoft Installer self-extracting archive

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/build/npp.7.9.2.Installer.x64.exe
./PowerEditor/installer/build/npp.7.9.2.Installer.x64.exe: PE32 executable (GUI) Intel 80386, for MS Windows, Nullsoft Installer self-extracting archive

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/build/npp.7.9.2.portable.7z
./PowerEditor/installer/build/npp.7.9.2.portable.7z: 7-zip archive data, version 0.4

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/build/npp.7.9.2.portable.zip
./PowerEditor/installer/build/npp.7.9.2.portable.zip: Zip archive data, at least v2.0 to extract

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/images/header.bmp
./PowerEditor/installer/images/header.bmp: PC bitmap, Windows 3.x format, 150 x 57 x 24

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/nativeLang/ta
tagalog.xml            tajikCyrillic.xml      tatar.xml
taiwaneseMandarin.xml  tamil.xml

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/nativeLang/taiwaneseMandarin.xml
./PowerEditor/installer/nativeLang/taiwaneseMandarin.xml: XML 1.0 document, UTF-8 Unicode text, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/nativeLang/french.xml
./PowerEditor/installer/nativeLang/french.xml: XML 1.0 document, UTF-8 Unicode text, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/nativeLang/english.xml
./PowerEditor/installer/nativeLang/english.xml: XML 1.0 document, UTF-8 Unicode text, with very long lines, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/nppSetup.nsi
./PowerEditor/installer/nppSetup.nsi: ASCII text, with CRLF line terminators

user@pc-9090 MINGW64 /c/xxx/notepad-plus-plus (master)
$ file ./PowerEditor/installer/APIs/cpp.xml
./PowerEditor/installer/APIs/cpp.xml: XML 1.0 document, ASCII text, with CRLF line terminators

Marmotian commented 3 years ago

I just ran a script on my Cygwin installation to run file on my \Windows\system32 directory on my far from fast 2017-vintage laptop. It processed 4563 files in 320 seconds (~14 files/s) and identified over 20 primary file types with many sub-types.

mere-human commented 3 years ago

@mere-human could you elaborate this ?

When the chunk of the file is decoded to a string, its length doesn't correspond to the byte length (considering wide char) or the decoding fails.

That was just an idea I got from reading the grep source code. For example, if we know the encoding already, we know the size of one character. Then, we can do wcslen(str) and if that is not equal to lineSizeInBytes / sizeOfCharacter this is probably a binary file. This is just another way to detect the NUL byte in the middle of the text.

In general, it is a good idea to get some inspiration from grep or file. Or even try incorporating it for the sake of file detection.

Cerno-b commented 3 years ago

Just my two cents:

If there is no perfect solution, can we maybe have different heuristics and give the user the choice of which one(s) to use? Maybe via an extensible dialog like the Visual Studio one mentioned above? On the other hand, the file lib seems pretty promising, although I kind of doubt it can robustly find all different kinds of custom binary formats (arcane image formats, memory dumps, etc).
This heuristic does not have to be perfect I think. If the odd binary file passes the filter by accident, it's not a big deal, at least for searching, as long as the filter is not too strict on rejecting non-binary files. It may be dangerous for replacing, since it could trash a binary file beyond repair, but in this case, a preview would be a good idea anyway (regardless of filtering for binary)

Dialecticus commented 8 months ago

There is a perfect solution. Windows has a feature called "Perceived types". Most text files are from get-go registered to be perceived as Text. Setups can register their own extensions as well.

Notepad++ could use angle brackets to identify those types. So if we set the search filter like this: "

All default perceived types are named with a single word, but I think it is possible to introduce more perceived types, which may have spaces in them. Hence the angle brackets.

alankilborn commented 8 months ago

So if we set the search filter like this: " !except.*" then N++ would include all extensions perceived as text files, and all perceived as audio files, and exclude files named "except" with any extension.

I don't see how your example does anything except exclude files named "except" with any extension. Where's the "angle-bracketed" things in that example?

Dialecticus commented 8 months ago

Angle bracketed string should be expanded into all the extensions that that string represents. More info is available in the provided link in my first post. There is a Windows function AssocGetPerceivedType that returns a perceived type of a give extension. You give it ".txt" and it returns "text". We need to find the opposite functionality. A function that would return a list of extensions, given the perceived type "text". It seems there is no such Windows function, or I can't find it. But the information is in the registry, and the extensions could be enumerated from there.

notepad-plus-plus / notepad-plus-plus

Add option ot search only text files - speed up all files search #9445