Feature Request: Find Duplicates

It would be very useful if NotePad v3 included the option to find duplicates.

Example:

Duplicate Words used in the TXT Duplicate Lines used in the TXT Duplicate Paragraphs used in the TXT Duplicate Sentences used in the TXT Duplicate IP Addresses used in TXT Duplicated Trackers used in TXT

The option should compare line per line, paragraph per paragraph, sentence per sentence, tracker per tracker, IP address per IP address and obviously word per word choices to suit everyone needs.

2021-03-03 000052

Hello @Taiwan-2021 , Did you try "Search in Files (Ctrl+Shift+F)" (latest line in your picture) ? 🤔

2021-03-04_043150

Hi and thank you for the suggestion regarding "search for files" but this action requires you to select a folder path location and then manually insert the TXT file name for applying a TERM entered to "search for". The problem with that is the search only looks for that term and never finds all the other duplicates line by line. Please, see the below example in demonstration 1.

What is needed was a method to not enter the term at all, and just have the NotePad application find ALL the duplicates by checking "line by line" for any lines that are duplicated.

Demonstration 1)

John went to the store. Karen went ice skating. Cathy is dancing. Karen went ice skating. Cathy is dancing. Karen went ice skating. 30,000 more lines Etcetera…

Question, in the above list, how many lines are duplicated?

If you entered any term, like "John" for example, the search only finds "John" rather than ALL THE OTHER INSTANCES of duplication within the TXT file. As you see above, the list has many lines (30,000) that include line duplication, such as Karen and Cathy lines. By entering only a specific term, all the other duplication lines are left out.

What the application needed to do was not have any term entered at all, but rather a method of comparing every line by line for duplication. Notepad should include the method for not only line by line, but also sentence by sentence, paragraph by paragraph and any other way of grouping data that is needed to check for duplication.

The "Search for Files" expect you to know the term you want to apply, and it never checks for all the other ways to find duplicates within the data, using methods of searching for duplication by grouping the data line by line, word by word, sentence by sentence, paragraph by paragraph, etcetera...

Have you tried using regular expressions?

For example, the following regex in the "Find" dialogue will highlight duplicated lines. Just turn on "Regular Expression search" in the dialogue. (Credit: Regular Expressions Cookbook, O'Rielly.)

^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)

With a bit of research, you can almost certainly craft a regex to do your other tasks too.

Hi, Craigo! Thank you for sharing the suggestion of using "regular expressions". I was excited to have the opportunity. However, when I tried doing this, by using your above example, I get a confusing result, see figure 1 below:

Figure 1 2021-03-04 000061

As you see from figure 1, the application did highlighted some lines, but it is not clear why it only highlighted in different highlight colors, only a few lines when one can see with their own eyes that other lines are duplicated. Then again, how does the user of the NotePad application remove or delete those duplicated lines from the TXT file for it?

Why was the first Karen line highlighted and second, but not the third, when in view on the screen? Why are different highlighting used for Karen line 2 and Karen line 4, but no highlighting on Karen line 6? Just as Cathy line 5 isn't highlighted either? By clicking on the "Find Next" button, it doesn't highlight anything else either, and yet your eyes can see the additional duplication in view, sitting right below.

So, I tried this again on another TXT file with 416 lines (lines are rows - but funny the application indicates these as columns), just to get a better idea of what to expect. The following figure 2 you will observe the result:

Figure 2

2021-03-04 000062

The exact same regular expressions are used ^(.)(?:\r?\n|\r)(?=[\s\S]^\1$) but this time, it highlighted an empty line (row), and another that doesn't indicate the above lines had a duplicated line there... upon clicking "Find Next" the application then gives out an error saying invalid.

In figure 2, this is a tracker list for torrents. What caused the invalid error to occur that doesn't occur in alphabet letters a-z? Was it the colon, or numbers? The regular expressions ^(.)(?:\r?\n|\r)(?=[\s\S]^\1$) seem difficult to understand what they actually do without some reference to their meaning and order here.

I still think a new feature added to Notepad that list the option under FIND duplicates AS "line by line", sentence by sentence, paragraph by paragraph, is more convenient and easier for the majority of NotePad users, unless a full understanding of what are regular expressions and how they apply themselves in whatever order given is explained for NotePad users here.

Maybe it is just me, but when I am expected to understand ^(.)(?:\r?\n|\r)(?=[\s\S]^\1$) regular expressions, it's looking like a lot of scripting language, you know what a software programmer writes for applications, than what most end users expect for using an application. Normally, most end users pick an option to apply, from a drop down menu or ribbon bar option.

I will admit, maybe knowing regular expressions would be more useful, as some other applications require their own shorten abbreviation in notation, like dddd - mmm dd, h:nn tt for instance. This is closer to human language, but it's really whatever the developer applied, and that isn't the same with every developer.

So in conclusion, the need for finding duplicates line by line still exist, that highlights them clearly, with the option to delete them all. I hope this was useful for everyone to see the need of an easier method that works.

If anyone uses HOSTSMAN (http://www.abelhadigital.com), it's an application that lets you find duplicates regarding IP Addresses that are inserted line by line. Say you got a list of 8,000 Microsoft Surveillance IP Addresses, but you compiled this list from different sources, so there is a good chance many of them might be duplicates. HOSTSMAN with 1 click finds all the duplicates, and then shows you them with the option to remove them. So you end up with a compiled list of zero duplicates. This is just to give an example of another application that including the "Feature Request" in NotePad is all.

What Notepad3 version have you been using?

I would start by using a portable version of the most recent build, to ensure you're using recent code and not inheriting any problems from existing config files.

The regex might also treat blank lines as a match. You can make the match more specific to your use case by matching lines that start with "udp", e.g.:

^(udp.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)

I recently had a use case whereby I needed to examine thousands of XML files and remove duplicate lines that had a specific string of characters in them. In the end I got the job done with Notepad3, regex and GrepWinNP3 (Search in Files). And an O'Rielly book 😄

I think a "Find Duplicate Lines" feature in Notepad3 would be great and I'd support adding it, but it would only serve one of your purposes and it wouldn't have served my recent need. Whereas regex will probably serve them all - and any other uses you can dream up in the future. Regex was my saviour. It is immensely powerful and utterly flexible. And a real pain to learn. But worth it.

Hi, Craigo! In answer to your question what version am I using - it's Notepad3 (x64) v5.21.125.1. Does this means with each new version of NotePad, that regex has different results and does applying regex follow the same rules in different applications? How many different versions of regex exist would be my next question? Is regex standardized, universal or mostly related to a specific type or manner of usage only applied in some applications, as I haven't seen the regex option used in other software applications, although perhaps it did exist and I just didn't know it? Who would know? Does Microsoft Word support using it for instance?

Please do not laugh, but in Microsoft Windows, "environmental variables" are hidden from the end user, so I didn't know that I could use them like in Windows Explorer to navigate such as %temp% without needing to type out the whole path. See https://ss64.com/nt/syntax-variables.html and this for "run commands" https://superuser.com/questions/217504/is-there-a-list-of-windows-special-directories-shortcuts-like-temp which even mentions a GodMode.{ED7BA470-8E54-465E-825C-99712043E01C} and in Windows 10, GodMode was changed to JeezMode.{ED7BA470-8E54-465E-825C-99712043E01C}

Unfortunately, in the tracking list regarding torrent servers, it also contains different protocols other than just UDP, such as HTTP for instance. As for only serving one of my purposes, to have a built in feature to find duplicates line by line, as compared to learning regular expressions using O'Rielly or other source, well that would require a "bit of investment in time", and English isn't my native language, nor are O'Rielly books like this (technical type that would be very specific in scope) are sold anywhere near my residence (farming community), or for that matter maybe even in my country using my native language (bisaya - an Austronesian language) in the Philippines. Perhaps, online would be the only method available to obtain such a reference source?

I do appreciate and understand the value of learning, realizing if the correct regular expressions (regex) is adopted rather than having a built in feature request, that more ways of conducting different examples can be applied for it. The same could be said if I would just learn high level programming and code my own software application too - but for many of us, we have our jobs, our families, and the time just isn't included for these time investments. It's why so many individuals afford purchasing a solution, if they can even do that. Many of us just have no idea what to even ask for mostly because we are not experienced in this sort of thing. But, that is to say, we cannot learn, and there is so much to learn about.

When looking into regular expressions, there is a huge amount to grasp, understand for how the expression functions and there are many of these expressions. How to quickly find the correct expression that one needs to apply without having to master the whole book or source of them? I would prefer speaking to the computer telling it in my own language what I needed to do, and it does it, by filling in these terms. In fact, that is the future of software engineering, not to learn every new language, every new syntax, but for some A.I. (artificial intelligence) that a machine remembers flawlessly unlike me, to just fill in what needs to be done for the job or task.

That would be 1x million times better than needing to learn regex and all those other languages, syntax, variables, environments, etcetera or what not things to remember... How about someone introducing "Intelligent Software" that does the low level stuff (fills in the gaps, write the code to do the task for you automatically), so people can focus primary on exactly what they are thinking and working on? I believe having intelligent software is the NEXT trillionaire empire, that would better support the intellectual economy, ushering in a new paradigm shift for humanity.

Why force everyone to remember trivial codes, that nobody will be capable of maintaining comprehension year after year of development? It's the same story with storage data, the original user is expect to remember where they put all those files of data, making it near impossible for another different person to know where to find it, as the original individual most certainly applies their own method of labeling folders and naming files, in their own language, that fails to help other language speakers. For us to move beyond all this, a better method needs to be applied where humans are not forced into or required to remember navigation address of a folder hierarchy and file name for later data acquisition. In addition, everyone organizes their data differently, because of their experience, specific usage, criteria and needs. That's assuming the user assigned or designated a proper “coherent” name for those folders and files to store the data.

What is missing in our data is recall, relevancy and distribution. An intelligent software that applies artificial intelligence can do this for us, so whenever someone needs to recall, they only need to do one thing, think in terms of what is relevant! And this makes everyone better for it, no chaos, just relevancy applied. As for distribution, the AI software applies this in the background so we never lose our data, by physical or mental references. There might even be a synaptic interface to connect accessing data, that's faster than using our eyes or ears. The human brain operates many times faster then our tactile sensors. But we haven't yet fully explored all the possibilities. This could all lead to a collective intelligence, as more integration in our ability to merge machine intelligence with our own human intelligence, to extend and evolve ourselves as a species. It's one way to survive machine intelligence, that brings the best of both together. No wars needed, no fighting over resources, no monopolies, just expand human existence using "machine intelligence" by bridging our minds at first with machines and then later on using "collective intelligence". It's just a thought... What we live for today, what we work for in our lives, will all change, if we just apply ourselves to evolve and adopt when we are honest, truthful and responsible for the actions we undergo and undertake.

Human society is on a verge of profound illumination, think of biogerontechnology - the advancement of the science and technology underlying the biological aging process which has the potential to not only extend the average natural lifespan, but also to simultaneously postpone many if not all of the costly and disabling conditions that humans experience in later life. Imagine never growing old with aging, your biological process is maintained, extending useful living to hundreds maybe even thousands of years, to be without many common aging diseases of today. People living thousands of years should be a lot wiser than those living 50 for instance. Add in machine intelligence, as it's a perfect mix of technology to repair biological system against aging, extending lifespans with greater machine intelligence.

I think the question then becomes what does humanity do with itself, but to continue the journey into universal exploration, ascending into immortal consciousness, expansion into other realms and dimensions of reality. Although, our species will not longer be human by then, it will have become so much more, but it's fragile now, so precarious and subjected to a multitude or risks - EMP or even a GEOMAGNETIC STORM from our solar sun. Add in climate change, gamma ray burst from blackholes, the advancing andromeda galaxy on a collusion course with our milkyway galaxy, the next ice age, or the next super volcano eruption, maybe even GMO (genetically modified organisms), it means humanity is on the clock literally. Maybe Elon Musk was right, to start colonizing Mars sooner...

A lot is going on, changing now, be COVID-19, the economy, the US, China, the EU, the UK, the way people are thinking, quantum computing, digital currencies, disruptive intelligence, it's so difficult to keep up!!! This is the age of transforming your work, your life, your world. So who gets to design the algorithms that will define human actions, human economics, human behavior and human existence? Where is the framework that includes ethics, wisdom, principles and values? What about sustainability, environmental and responsibility?

Okay, this is long and getting off the post topic, but sometimes we need to say more, to just be heard, to know we are not alone. Thanks for the assistance. :-)

I am not going to respond to the philosophical excurse you posted here, cause this is not the right place and, unfortunately, it would blast this issue. Here are some short answers regarding your issue related questions:

For learning Regular Expressions you can find a lot of (more or less good) web sites. A good starting point could be, e.g.: https://github.com/ziishaned/learn-regex (Ed.: a good reference: https://www.regular-expressions.info/duplicatelines.html)
Yes, there are a lot of different RegEx dialects/flavours, in principle (core) the same, but with small differences.
Notepad3 plugs in the Oniguruma RegEx engine using its default syntax, which close to the syntax, used in programming language Ruby. Maybe a future option would be in Notepad3 to switch the dialect/flavour of the RegEx engine by configuration 🤔.
The Oniguruma RegEx engine and Notepad3's engine (Scintilla) are glued together by a small interface layer, which enables the communication between both engines.
Yes, Oniguruma and Scintilla and the small layer are still developing, which introduces bugs, which have to be fixed, so the question for version is important. E.g. Oniguruma learned the new dialect "Python", I fixed a bug in the small glueing layer which handles proceeding after zero-length-findings not in a correct way, etc.
Elon Musk is not right in colonizing Mars, it is like trying to check the one and only lifeboat, while the Titanic is steering into the iceberg (better change the course).
I have to digest the idea (feature request) of "finding any duplicates" a little bit (lines only, paragraph related, function/methods, anything else ...) (we already have the feature "finding duplicates of selected text" called "Focused View").
Different colors: Current Line Indicator, different to Occurrences Indicator different to Selection Indicator plus possible overlays of transparent Indicators (especially in "Highlight Focused View").

Hi, RaiKoHoff! Thank you for sharing some information about Notepad v3 using Scintilla with the RegEx Oniguruma engine.

As you said, the effort is on-going (still developing), and it's past history is full of dialects/flavors (variants), which only makes using this syntax of regular expressions more complex and complicated over time with each new version, revision, change or deprecation.

Question, should Notepad v3 that was originally used as a Microsoft Notepad REPLACEMENT now expect for it's end users to learn programming language like Ruby just to support using a common task that can be added as a feature in the software application?

If the "feature request" is built in, it works for EVERYONE using that version of Notepad. The PC end user doesn't need to fiddle around with syntax, coding, programming languages, trying or attempting to discover what functions properly for it.

What should a replacement for the Microsoft Notepad software application be designed for;

a) features (easy for PC end users) to be apply by everyone...? b) or programmers needing to understand a library of code to work with?

What is the "difference between" PC end users of software and those who write code in programming;

a) end users expect solutions (usually easy) to be applied. b) programmers NEED to create the code (instructions) for an action to be applied.

My original "Feature Request" was to add in a solution that lets EVERYONE find duplicates with the option of grouping (arranging) data by;

a) line by line b) sentence by sentence c) paragraph by paragraph d) etcetera...

A method to which doesn't require entering any term or word but does support searching for duplication by the arrangement of data. This is extremely useful, whenever anyone is pasting together prior work, previous information/data that often gets duplicated because of different individuals compiling the work.

Obviously, I cannot speak on behave of everyone using Notepad v3, but when I say, most PC end users who had replaced their Microsoft Notepad for Notepad v3 did so for the convenience sake of obtaining new features and functions, rather than expecting to learn programming language like Ruby just to implement a common task.

Look at Figure 1 (below)

2021-03-07 000066

Notice how most of the "menu options", support a feature that is common; like cut, copy, paste, search > replace, but search > for duplication is missing!!!

Of course, there is nothing wrong with inserting experimental (non-standardized code) code for the purpose of extending the usefulness of the software application - but "regular expressions" isn't so regular with PC "end users", that is unless they are using programming languages like Ruby.

Also, NotePad wasn't adopted by it's PC end users as a programming software application for Ruby on Rails, it doesn't support intelligent code like Visual Studio or Visual Studio Code does.

Undoubtedly, by supporting more ways to do things, like using "regular expressions", can add value, but without knowing a programming language like Ruby, it's not helpful to those PC end users as compared to code writers and programmers.

So, whom is Notepad most suited for, code writers that require learning programming languages or a niche environment where PC end users who had replaced the original Microsoft Notepad to obtain more features and functions?

Will Notepad v3 want to compete against Atom, Sublime Text 3, Codespaces, GNU Emacs, Visual Studio Code functionality for code writers and software programmers/engineers?

If Notepad v3 plans to continue to be a Notepad replacement software application, then why shouldn't it include the common task of finding data duplication?

Note: It's important to distinguish the method of detecting how the data is grouped (arranged) for sorting.

Examples;

• Line by Line (row by row) • Sentence by Sentence
• Paragraph by Paragraph • Section by Section • Chapter by Chapter • File By File

There are many more ways to group or arrange data, but EVERYONE would benefit for having a feature of finding "duplication".

I have to digest the idea (feature request) of "finding any duplicates" a little bit (lines only, paragraph related, function/methods, anything else ...) (we already have the feature "finding duplicates of selected text" called "Focused View").

Dear @Taiwan-2021 ,

First of all, please stop spamming this thread with pages and pages of comments and off-topic comments! 😬
Second, @RaiKoHoff has already responded to your "Feature Request" (see above)!
Repeating it over and over will not change anything!
When you compare with another applications, remember that Notepad3 is a "FREE Open Source Application"!
And also take into account this important note: Rizonesoft Support ! 🤔

@ghost (Deleted user ghost), because "Taiwan-2021" has left GitHub.com, I'm closing this issue ! 😿

rizonesoft / Notepad3

Feature Request: Find Duplicates #3174

What is needed was a method to not enter the term at all, and just have the NotePad application find ALL the duplicates by checking "line by line" for any lines that are duplicated.