notepad-plus-plus / notepad-plus-plus

Notepad++ official repository
https://notepad-plus-plus.org/
Other
22.53k stars 4.56k forks source link

Alphabetical text sorting does not work very well, especially for not-English text (but not only) #13456

Open grzegorj opened 1 year ago

grzegorj commented 1 year ago

Description of the Issue

The expected behaviour of the sort function is to operate correctly not only for English. Unfortunately, it is not the case.

Steps to Reproduce the Issue

Open a new text file and write, one word in a line, ex. laka łąka mąka mika

(note EOL = the end of the line at the end of the text) These are Polish words, and the lines have already been sorted correctly by the alphabetical order. Nevertheless, try to sort them in Notepad++.

Expected Behavior

The function “Sort” should not change anything in this text

Actual Behavior

The final EOL is transferred at the beginning of the file, and the words are not correctly sorted any longer (the correct alphabetical order is totally ignored). You receive:

laka mika mąka łąka (note the empty line at the beginning; there is no EOL after the word “łąka”)

The bug is checked in Notepad++ v. 8.5 BTW, the correct alphabetical order in Polish is aąbcćdeęfghijklłmnńoópqrsśtuvwxyzźż. Each letter is treated separately, not as a variant with a diacritic.

alankilborn commented 1 year ago

There are TWO distinct issues here.
It is too bad that you combined them, as one could easily be fixed, but not the other. Combining them probably means nothing gets fixed.

grzegorj commented 1 year ago

Indeed, there are two issues (but within the same scope). For me, ignoring the correct alphabetical order is much more important than the issue with EOL (which I have noticed by the way). Either Notepad++ is “totally English” (and then it can sort only in accordance to ASCII codes), or it is international. But if the other alternative is correct, the lack of a possibility to sort text lines correctly is a very serious deficiency of the program. Of the programs I know, only MS Word can do it correctly, using the system locale (which is present under Windows) instead of low-level sorting based on ASCII codes. It would be REALLY nice if also Notepad++ had a similar option. And it would not be so English-biased any longer as it is now, despite existing superficial localizations (of the menu etc.). Really. The thing is worth fixing. Even if it may not be very simple.

rdipardo commented 1 year ago

Of the programs I know, only MS Word can do it correctly, using the system locale (which is present under Windows) instead of low-level sorting based on ASCII codes. It would be REALLY nice if also Notepad++ had a similar option.

Apples to oranges.

Notepad++ is a source code editor. Source code is almost always (for historical reasons) a plain text file of ASCII keywords and punctuation symbols.

MS Word is a business document editor. Businesses have offices around the world, so naturally their software can intelligently edit internationalized text.

grzegorj commented 1 year ago

Notepad++ is a source code editor. Source code is almost always (for historical reasons) a plain text file of ASCII keywords and punctuation symbols.

Not exactly.

Notepad++ is a free (as in “free speech” and also as in “free beer”) source code editor and Notepad replacement that supports several languages. (https://notepad-plus-plus.org/)

As you can read, Notepad++ is not only source code editor. It is also a Notepad replacement. And it, as if, supports several languages. It is not true: in its present shape, it cannot sort texts in several languages. So, there IS a problem with sorting, contrary to what has been written.

Besides, Notepad++ can serve Unicode texts. Programmers do not need Unicode to write codes! So, the statement that Notepad++ is only a source code editor is completely groundless. So, rdipardo, don’t you think that linking the Wikipedia article on the historical reasons for using English in source code was with no relation to the subject and just rude of your side?

Perhaps you use Notepad++ only to writing code. I do not, and plenty of other uses do not as well. I also use it to my not business notes, and it is fully consistant with the program’s declared purpose. Notepad is also not a business notes editor, and also not a source code editor.

Moreover, the function under question is called “Sort”, and not “Sort the code lines”! Telling the truth, I cannot imagine the need of sorting code. But I do see a need to sort a text file that is not a code. And such a text needn’t be in English. So, the lack of correct sorting is a serious bug of the program, and there is a need to fix it.

In other words: a source code does not need to be sorted. If the Notepad++ was only a source code editor, it would not need the sort function at all. But if it has the function, it really should sort texts correctly, and not only in accordance with the English alphabet.

I am not a programmer and I cannot help with it, but I do not think it is a really hard thing to implement correct sorting of texts, so I cannot understand this negative argumentation. It is enough to forget ASCII codes of the characters, and to use different codes of them, instead. For a Polish text: assign “0” to space, “1” to “a” (and “A”, and also “á”, “Á”, “à”, “À”, “ä”, “Ä”, “å”, “Å” etc., as foreign letters “a” with diacritics should also be sorted in Polish text as if they were simple “a”’s), “2” to “ą” (and to “Ą”), “3” to “b” etc. And then sort the text in accordance with these codes, and not with ASCII codes. Make a similar table of codes for each language Notepad++ supports. That’s all. I do not think it would be really hard. The procedure of sorting itself should be exactly the same as it is now, except those mentioned letter codes was sorted, not ASCII codes as it is now.

Once again, if rdipardo uses Notepad++ as a source code editor, he does not need to use sorting at all. But I use it for different purposes, and I need sorting. Since the program is declared to be a Notepad replacement and not a source code editor only, please stop discuss about the motivation, and just fix the bug of the program.

Thank you in advance.

grzegorj commented 1 year ago

One more note to what rdicardo wrote.

As I have noticed, Notepad++ does not use a two-phase sorting procedure. In this style, first, capital and lowercase letters are treated the same, and then capital letters gain higher priority over the corresponding lowecase letters.

Instead, for example, the text: Basia asia basia Asia

is sorted to:

Asia Basia asia basia (with the mentioned transferring of the EOL sign at the beginning) which is totally incorrect even for code writers!

The expected result of sorting is: Asia asia Basia basia

(with the option “capitals first”), or: asia Asia basia Basia

(with the option “lowercase first”).

So, Notepad++ cannot sort correctly even English text.

BTW. The correct way of sorting should take into account diacritic letters. E.g. German umlauted letters, contrary to Polish letters of the type “ą” or “ł”, must be treated as simple letters when sorting. But when two words differs only in umlaut, the plain letter must be taken first. Ex. the correct alphabetic order is: Ohr, Öhr, Ohrenarzt, ohrenbetäubend, Ohrenbläser.

In Polish, “c” and “ć” are different, but “c” and “č” are the same (“č” is not a letter of the Polish alphabet). So, the correct order is: cap, capek, Čapek, car, czysty, ćma. Notice that “c” and “č” have the same place in the alphabet sorting, but when “háček” is the only difference, the word with the accented letter must follow the word with the plain letter.

Sorting is not a simple procedure (but should really be simple to implement), and treating sorting of text lines as sorting of ASCII codes is a serious bug. Even for programmers...

grzegorj commented 1 year ago

I have just found a program which is a declared as a clear programmer’s editor, PsPad (http://www.pspad.com/en/). Despite it is a clear source code editor and not a text editor, it can sort texts fully correclty, and even better than MS Word does.

Anyway, compares of oranges to apples are just clear fantasies of an arogant person. We should collaborate and not admonish others and not pretend to be smart (while being nothing more but rude) and send others away to Wikipedia articles for them to learn such or another thing. PsPad is not a “business document editor” and can sort text files correctly. So, Notepad++ has a serious bug, and lacks correct sorting.

Dear developers, please correct this bug. Take an example from the other source code editor!

molsonkiko commented 1 year ago

@grzegorj , in spite of your obnoxious lecturing of programmers about what "should really be simple to implement" despite having not the faintest idea of how little relationship there is between the outside-view perceived difficulty of programming something and the actual difficulty of implementing it, I happen to be interested in trying to work on this problem, or at least a more tractable subset of it.

Newer versions of Notepad++ have Line Operations->Sort lines lexicographically [asc/desc]ending ignoring case, which addresses one of the issues you mentioned in your last post. 8.3.3, which is a year old, is the oldest version I've checked that has it.

I see what you're saying about the transposition of empty lines when you sort the file. Probably the easiest solution would simply be to remove all empty lines from the region to be sorted before sorting.

Also looks like adding a function similar to ToUpperInvariant but using LOCALE_USER_DEFAULT instead of LOCALE_INVARIANT.

So my new function would be

static TCHAR ToUpperCultureSensitive(TCHAR input)
{
    TCHAR result;
    LONG lres = LCMapString(LOCALE_USER_DEFAULT, LCMAP_UPPERCASE, &input, 1, &result, 1);
    if (lres == 0)
    {
        assert(false and "LCMapString failed to convert a character to upper case culture-insensitively");
        result = input;
    }
    return result;
}
molsonkiko commented 1 year ago

Any help trying to make my new sorter class culture-sensitive would be appreciated. I'm pretty sure it doesn't work yet though.

alankilborn commented 1 year ago

(note the empty line at the beginning; there is no EOL after the word “łąka”)

I see what you're saying about the transposition of empty lines when you sort the file.

@grzegorj 's complaint regarding something that is empty is that sorting (lex. ascending) this file:

image

incorrectly results in this:

image

How is that in any way correct? But, stunningly, the author of Notepad++ judged it so. :-(

Further, and also stunning, if one does Ctrl+a before the sort, to select all text and every line in the file:

image

Then the sort result IS correct:

image

Probably the easiest solution would simply be to remove all empty lines

If one does Remove Empty Lines, pre-sort:

image

And then sorts, a reasonable result is obtained:

image

But, as I don't consider the last "line" as a true line unless it has a line-ending on it (and I enforce this, for all of my files, with the editorconfig plugin), I can't obtain the sort result I want without doing a lot of "pre-thinking" about how to obtain correct results.

THIS is the part that I wish @grzegorj had broken out into a completely separate issue when I said:

There are TWO distinct issues here.

Combining it with a language-specific sorting order problem completely loses this detail.

molsonkiko commented 1 year ago

Good point.

It looks like there's some logic for removing empty lines in NumericSorter in Sorters.h that could be changed slightly, moved to a separate function, and called before sorting in each of the ISorter implementations in that file.

alankilborn commented 1 year ago

It looks like there's some logic for removing empty lines in NumericSorter ...

No empty lines need to be removed. Just more care need to be taken into account to avoid that empty non-line at the end of file, rather than letting it get drug into the search (and thus "sorted" to the top of the results).

molsonkiko commented 1 year ago

maybe changing this function to the below would work? It did for me locally, now if there's an empty line at EOF it's not moved. The current version of this function specifically chose to not remove that last empty line if sorting the entire document. So basically, went out of its way to behave in a way you didn't like.

void ScintillaEditView::sortLines(size_t fromLine, size_t toLine, ISorter* pSort)
{
    if (fromLine >= toLine)
    {
        return;
    }

    const auto startPos = execute(SCI_POSITIONFROMLINE, fromLine);
    const auto endPos = execute(SCI_POSITIONFROMLINE, toLine) + execute(SCI_LINELENGTH, toLine);
    const generic_string text = getGenericTextAsString(startPos, endPos);
    std::vector<generic_string> splitText = stringSplit(text, getEOLString());
    bool lastLineEmpty = splitText.rbegin()->empty();
    if (lastLineEmpty)
    {
        splitText.pop_back();
    }
    assert(toLine - fromLine + 1 == splitText.size());
    const std::vector<generic_string> sortedText = pSort->sort(splitText);
    generic_string joined = stringJoin(sortedText, getEOLString());
    assert(joined.length() + getEOLString().length() == text.length());
    if (lastLineEmpty)
    {
        joined += getEOLString();
    }
    if (text != joined)
    {
        replaceTarget(joined.c_str(), startPos, endPos);
    }
}
alankilborn commented 1 year ago

@grzegorj

A new plugin called Columns++ offers a sorting option; you may want to try it to see if it sorts your text differently/better.

See:

molsonkiko commented 1 year ago

Second alankilborn's suggestion. Caveat: (at the time of this writing) you need to select the region you want to sort first, or the command does nothing. Before:

bass
baßk 
bLue
blue
blüe
blve
oyster
öyster
spä
spb
Spb

after Columns++ -> sort descending (locale):

spb
Spb
spä
öyster
oyster
blve
blüe
bLue
blue
baßk 
bass

Given that this functionality now exists in a plugin, do people think this request is too niche to bother including as a default feature, or should it be added anyway (probably stealing code from the plugin)?

alankilborn commented 1 year ago

Caveat: (at the time of this writing) you need to select the region you want to sort first, or the command does nothing.

I don't think that is a caveat ... the plugin is built around the concept of acting on a selection, specifically a column-selection.

do people think this request is too niche to bother including as a default feature, or should it be added anyway

I think it would be best as a core N++ feature.

molsonkiko commented 1 year ago

I guess the other question is: does it make more sense to simply replace the existing lexicographic sort with a locale-sensitive sort, or add a new locale-sensitive sort on top of the current lexicographic one?

alankilborn commented 1 year ago

does it make more sense to simply replace the existing lexicographic sort with a locale-sensitive sort, or add a new locale-sensitive sort

I'm sure if the new one replaces the old one, someone will be impacted that they can't do the same sort they used to. I'd suggest adding, not replacing.

Coises commented 1 year ago

Just a note, for what it might be worth:

When I added sort functions to Columns++, my purpose was to deal with the fact that Notepad++ sort with a rectangular selection doesn't work when there are tabs in the file. I thought of suggesting a change in Notepad++ and offering a pull request, but I realized the entire sorting strategy used by Notepad++ would have to change. My impression is that the existing sort is meant to be reasonably efficient even for very large files. That was not a priority for me, while working with tabs was.

I came up with the "locale" sort when I started to look into exactly what I'd need to do to make a "case insensitive" sort. Outside the ASCII range, that concept isn't well-defined without specifying a locale; knowing the code page isn't good enough. That led me to the code I use here. (At present, options is always LCMAP_SORTKEY | NORM_LINGUISTIC_CASING | LINGUISTIC_IGNORECASE | SORT_DIGITSASNUMBERS.) This method fit sensibly with what I was already doing (storing sort keys as members of objects in a vector, with the full line data stored elsewhere), but it wouldn’t mesh with the method Notepad++ uses.

I plan at some point to add a user option to select a locale other than the user default, and to change some of the other options associated with the derivation of the sort key. Right now I think the sort I use is the same as Windows uses to sort filenames in Windows Explorer.

One other thing... no sort based on Windows locale sorting is anything like what people think of as a "case sensitive" sort. When you don't specify LINGUISTIC_IGNORECASE (or NORM_IGNORECASE), the sort still uses the relevant alphabetical order as the primary sort; only when every letter and symbol are equal except for case (as defined in that locale) does the sort distinguish case. The familiar ASCII order with all the capitals ahead of all the lower case does not exist in a locale-sensitive context.

molsonkiko commented 1 year ago

Thanks, @Coises ! I was going to add lexicographic ignore-case culture-sensitive sorting as an option in addition to the existing case-insensitive sorting, and probably just use your code as a jumping-off point. Unfortunately I have another PR already open for the other issue the OP raised (namely, EOF at EOL being sorted to top of file), and I can't start working on implementing this new sort until something is done on my other PR.

alankilborn commented 1 year ago

I can't start working on implementing this new sort until something is done on my other PR

Why's that?

Coises commented 1 year ago

does it make more sense to simply replace the existing lexicographic sort with a locale-sensitive sort, or add a new locale-sensitive sort

I'm sure if the new one replaces the old one, someone will be impacted that they can't do the same sort they used to. I'd suggest adding, not replacing.

My personal feeling is that a case-sensitive locale sort is useless, and a case-insensitive sort that isn’t locale-aware is incoherent (though if you only use ASCII, you’d never notice). As a practical matter, there are already a lot of sorts on that menu... so while I’d lean towards replacing just the “Ignoring Case” sorts with locale-aware sorts, picking the right terminology so users don’t get confused could be challenging.

As I mentioned, though, there’s also the problem that the sorting strategy used by Notepad++ is not readily adaptable to locale-based sorting (nor to rectangular selections when there are tabs in the file).

molsonkiko commented 1 year ago

I can't start working on implementing this new sort until something is done on my other PR

Why's that?

Because I was being silly and forgot that I could just create a new branch on my fork that was independent of the branch I was using for my other PR. 😝

I've gotten back to work on the culture sensitive locale sort, and results so far are encouraging.

@Coises , when I was studying your code in Columns++'s Sort.cpp, I couldn't help but notice that you use SORT_DIGITSASNUMBERS, which I looked up in the relevant documentation. When I tried to use it in my fork of notepad_plus_plus, though, I couldn't use it. I'm on Windows 10, and #include <WinNls.h> didn't make it visible. What do you think might be going on?

Coises commented 1 year ago

@molsonkiko Just a guess; from WinNls.h:

//  Sort digits as numbers (ie: 2 comes before 10)
#if (WINVER >= _WIN32_WINNT_WIN7)
#define SORT_DIGITSASNUMBERS      0x00000008  // use digits as numbers sort method
#endif // (WINVER >= _WIN32_WINNT_WIN7)

Is it possible that you're compiling for a Windows version lower than that?

molsonkiko commented 1 year ago

Is it possible that you're compiling for a Windows version lower than that?

In retrospect, yes, obviously. Notepad++ would definitely be compiling for Windows 7 because it's been around for so long.

Coises commented 1 year ago

@molsonkiko Hmmm... I see: <WindowsTargetPlatformVersion>10.0</WindowsTargetPlatformVersion> in Notepad++ and in your fork. Is it possible you have something different on your development machine? (It is also possible that I don't know what I'm looking at...)

rdipardo commented 1 year ago

I see: <WindowsTargetPlatformVersion>10.0</WindowsTargetPlatformVersion> in Notepad++ and in your fork. Is it possible you have something different on your development machine? (It is also possible that I don't know what I'm looking at...)

Despite the name, it actually refers to the SDK version. Visual Studio's corresponding option is more explicit:

vs-win-sdk-ver-selection

It determines the compiler's header search path and the linker's library path.

You could think of it as the maximum supported Windows version, since every new feature goes into the SDK before it's released in the next build of the OS.

Only developers of bleeding-edge apps need to care to it. Notepad++ runs almost entirely on core system libraries that have barely changed in 35 years.

molsonkiko commented 1 year ago

@rdipardo , thanks for explaining how that works.

I'm not 100% sure what that means about the possibility of including SORT_DIGITSASNUMBERS, though. I assume based on this discussion that we would have to change the Notepad++ core vcxproj file.

Coises commented 1 year ago

@molsonkiko Check the setting @rdipardo illustrated in the development environment where you couldn't compile using SORT_DIGITSASNUMBERS.

If that doesn't explain it, follow LINGUISTIC_IGNORECASE to WinNls.h (right-click and choose Go To Definition) and scroll until you see a preprocessor statement involving WINVER. Hover over WINVER and see what the tooltip says is its value. It must be at least 0x0601 for SORT_DIGITSASNUMBERS to work. If it's not, and the Windows SDK Version setting doesn't explain it... then there's something to figure out, I guess.

molsonkiko commented 1 year ago

Well, I tried playing with those settings, and it didn't help.

So just on a stupid whim, I just inserted #define SORT_DIGITSASNUMBERS 0x00000008 in Common.h (copied from WinNls.h per Coises' suggestion), and wouldn't you know, it worked.

Now Sort Lines Lex. Asc. Culture-sensitively ignoring case sorts

11
2

to

2
11

Hooray! 🎉

rdipardo commented 1 year ago

from WinNls.h:

//  Sort digits as numbers (ie: 2 comes before 10)
#if (WINVER >= _WIN32_WINNT_WIN7)
#define SORT_DIGITSASNUMBERS      0x00000008  // use digits as numbers sort method
#endif // (WINVER >= _WIN32_WINNT_WIN7)

Is it possible that you're compiling for a Windows version lower than that?

As it turns out, yes: see how the property sheet defines _WIN32_WINNT:

https://github.com/notepad-plus-plus/notepad-plus-plus/blob/c76f178534bba518f263c8120b732801e13c4916/PowerEditor/visual.net/notepadPlus.Cpp.props#L28

_WIN32_WINNT_VISTA is less than _WIN32_WINNT_WIN7, causing #if (WINVER >= _WIN32_WINNT_WIN7) to return false, so SORT_DIGITSASNUMBERS is never defined.

Swapping in _WIN32_WINNT_WIN7 lets @molsonkiko's fork compile without any kludge. Adding a #define that might someday be duplicated is just an accident waiting to happen.

Since N++ dropped Vista in 8.4.7 (or rather, Microsoft did), it's a good time to ask if _WIN32_WINNT can target Win7 instead.

molsonkiko commented 1 year ago

Thanks, @rdipardo ! Not sure I would have known where to look for _WIN32_WINNT without your help!

A reasonable compromise that doesn't require changing anything globally would be to just define const int sort_digitsasnumbers = 8; inside the function that uses it. It should be optimized away at compile time anyway.

donho commented 1 year ago

@grzegorj

please stop discuss about the motivation, and just fix the bug of the program.

A lot of people including @rdipardo are volunteers here to help people and try to provide a better or even best code/text editor to you with no charge. So nobody owe you a "stop discuss about the motivation, and just fix the bug of the program".

grzegorj8 commented 1 year ago

A lot of people including @rdipardo are volunteers here to help people and try to provide a better or even best code/text editor to you with no charge. So nobody owe you a "stop discuss about the motivation, and just fix the bug of the program".

I am very grateful to people who want to help others, and that includes NotePad++ developers. However, I refuse to be told that NOTEpad isn't for taking notes, but that it's purely a source code editor (as its name implies, bravo!). The fact that someone gives their time and skills for free does not give them the right to despise others, mock them and make fools of them.

ArkadiuszMichalski commented 1 month ago

In the case of Polish language the correct expectation from the first post is what should be returned. Until now I hadn't noticed that Notepad++ was sorting my native language incorrectly (or rather not the way I wanted it to be). We currently have a lot of sorting commands and adding more may not be acceptable.

I'm planning to make a plugin where I will put some alternative methods compared to those in Notepad++ (mainly faster variants), but this alternative sorting would certainly be useful in it.

grzegorj8 commented 1 month ago

@ArkadiuszMichalski According to you, the sequence laka mika mąka łąka means that Notepad++ sorts Polish words correctly? Are you kidding? Or you do not understand what “sort” means?

Oh please... stop this stupid game! Notepad++ cannot sort at all. I have reported this bug a year and a half ago and nothing has changed. The EOL bug (“one could easily be fixed”) has not been fixed as well.

Virtually nothing has happened, except for the comments of a certain insolent ignoramus who insisted that Note-pad is not for taking notes (and for this you need Word instead) and that programmers need to sort program lines, and the backtalk of a certain boor who told me to shut up because he was trying to solve “my” problem “for me”. I had enough of this primitive boor and didn't say anything more, and what has he done for a year and a half? He has done absolutely nothing!

The problem is not mine, but every user’s, and the fact that someone devotes their free time and participates in the development of the program does not give them the right to talk in this way. He can talk this way to his dog, or maybe his girlfriend, but if she’s smart, she’ll shoot him in the face and knock his teeth out for such words.

This is the first time I’ve encountered a situation where, in response to a report of a program error, the programmers respond with a personal attack on me or try to lecture me that they do something for free, so they can do whatever they want, and I have to beg them on my knees to kindly solve a bug in the program they're working on. Is this a madhouse or a forum about the Notepad++ program?

Or maybe the programmers working here simply have no idea how to write a procedure that correctly sorts text? If they don't understand that Notepad is for editing notes and other texts, and not just for writing program code, then I'm not surprised that they can't write a procedure that correctly sorts texts. I can't do that either, but at least I'm saying it straight and not telling anyone what to use or not to use a program that is, as its name suggests, for taking notes. And they’re trying to take out their frustration on me?

Gentlemen, a little more culture would be useful. And remember that if you do something for free, it's not for yourself, but for others. Instead of barking at them, listen to what they have to say and try to solve the problems they report, and don't bark at them like rabid dogs. I recommend buying a personal culture handbook and reading a bit to learn how to react to other people's comments.

grzegorj8 commented 1 month ago

I am not used to working with people who do not respect me at all and turn to me like a dog. I also feel very uncomfortable when I am about to help people who say that the _note_book is not used to make notes, or that the order of “laka mika mąka łąka” means (“for him”) that the words are sorted correctly – because I start to think about the intellectual levels of such people. And I do not want to do that.

Nevertheless, since absolutely no one here can write a sorting procedure, I will try to help despite the complete lack of my programming skills.

Some time ago I wrote a program in PHP which correctly sorts Polish texts. It could certainly be written differently and simpler. Nevertheless, since no one can even write it, it could be useful to someone.

`<?php

//Test polskich znaków

// Deklaracje stałych $n=chr(13).chr(10); define("NL",$n); define("SP",' '); define("TB",chr(9));

// Zmienną $sledz ustawiamy na FALSE, ale na TRUE dla debugowania programu $sledz=TRUE;

// Inicjujemy dwuwymiarową tablicę tlinia, inaczej wyskoczy błąd przy pierwszym wywołaniu funkcji in_array $tlinia[0][0]=TB; $tlinia[1][0]=' ';

// Dla celów debugowania otwieramy plik debug.txt if ($sledz) { $dplik='debug.txt'; $plik2=fopen($dplik,'w') or exit("Otwarcie pliku $dplik było niemożliwe
"); flock($plik2, LOCK_EX); }

// Otwieramy plik alfabetu al.txt $nplik='al.txt'; $plik=fopen($nplik,'r') or exit("Otwarcie pliku $nplik było niemożliwe
"); flock($plik, LOCK_SH);

// Wczytujemy kolejne niepuste linie aż do końca pliku // Znaki podane w pierwszej linii mają taką samą pozycję w alfabecie jak spacja $i=0; while (!feof($plik)) { $linia=trim(fgets($plik, 512));

// Interesują nas tylko linie zawierające jakąkolwiek zawartość,
// oraz linia pierwsza i druga (nawet gdy są puste)
if ((strlen($linia) > 0)||($i==0)||($i==1))
    {
    // Dwuwymiarowa tablica $tlinia przechowuje wszystkie znaki alfabetu
    // Pierwszy indeks oznacza numer znaku w tablicy
    // Drugi indeks określa kolejną literę mającą taką samą pozycję
    // Tylko linie niepuste
    if (strlen($linia) > 0)
    {
  $tlinia[$i]=explode(' ',$linia);
  }
    // Tabulacja i spacja nie są ujęte w pliku al.txt, dlatego...
    if ($i==0)
    {
    // dodajemy je ręcznie do tablicy, o ile ich tam nie ma
    if (!(in_array(TB,$tlinia)))
    {
    $tlinia[0][0]=TB;
    };
    if (!(in_array(' ',$tlinia)))
    {
    $tlinia[1][0]=' ';
    }
  }
    // Debugowanie: zapisujemy tablicę $tlinia do pliku debug.txt
    if ($sledz)
    {
    foreach($tlinia[$i] as $k=>$w)
    {
    fputs($plik2,'$tlinia['.$i.']['.$k.']=');
    fputs($plik2,$tlinia[$i][$k]);
fputs($plik2,NL);
    }
  }
    $i++;
    }
}

flock($plik, LOCK_UN); fclose($plik);

// Koniec debugowania if ($sledz) { flock($plik2, LOCK_UN); fclose($plik2); }

// Tworzymy tablicę $pozycja, która literom przypisuje określone pozycje alfabetu // Funkcja count zlicza ilość elementów w tablicy $i_maks=count($tlinia); for ($i=0;$i<$i_maks;$i++) { $j_maks=count($tlinia[$i]); for ($j=0;$j<$j_maks;$j++) { $pozycja[$tlinia[$i][$j]]=$i; } } // Ustalamy długość pola cyfrowego w reprezentacji // posługując się logarytmem dziesiętnym ilości elementów tablicy ;-) // zaokrąglonym w górę (funkcja ceil) $dlugpol=ceil(log10($i_maks)); $pole=''; for ($i=0;$i<$dlugpol;$i++) $pole.='0'; $dlugpol*=(-1);

/ Otwieramy plik danych do posortowania, wczytujemy rekordy i tworzymy ich reprezentacje /

$nplik='we.txt'; $plik=fopen($nplik,'r') or exit("Otwarcie pliku $nplik było niemożliwe
"); flock($plik, LOCK_SH); $i=0; while (!feof($plik)) { $linia=trim(fgets($plik, 1024)); if (strlen($linia) > 0) { $liniads[$i]=$linia; $liniarp[$i]=''; $dlugosc=strlen($linia); for ($j=0;$j<$dlugosc;$j++) { $znacznik=$pole.$pozycja[$linia[$j]]; $znacznik=substr($znacznik,$dlugpol); $znacznik=strtr($znacznik,"0123456789","abcdefghij"); $liniarp[$i].=$znacznik; } $i++; } } flock($plik, LOCK_UN); fclose($plik);

asort($liniarp);

$plik=fopen('wy.txt','w'); flock($plik, LOCK_EX); $imaks=count($liniarp); reset($liniarp); for ($i=0;$i<$imaks;$i++) { fputs($plik,$liniads[key($liniarp)]); fputs($plik,NL); next($liniarp); } flock($plik, LOCK_UN); fclose($plik); echo 'Operacja wykonana, wynik sortowania zapisano w pliku wy.txt'; ?> `

All comments are in Polish, do the use of Google Translate if interested. The program needs the Polish alphabet in a separate file:

` 

a A á Á ä Ä ą Ą b B c C č Č ć Ć d D ď Ď e E é É ě Ě ę Ę f F g G h H i I í Í j J k K l L ł Ł m M n N ň Ň ń Ń o O ö Ö ó Ó p P q Q r R ř Ř s S š Š ś Ś t T ť Ť u U ú Ú ü Ü v V w W x X y Y ý Ý z Z ž Ž ź Ź ż Ż 0 1 2 3 4 5 6 7 8 9

`

Use it if you find it useful.

grzegorj8 commented 1 month ago

al.txt lexsort.php.txt The program is highly imperfect. It can only sort the whole text in a file called we.txt, placed in the same folder, and saves the result in the file wy.txt. As for now, it can serve the Polish language only. The sorting process uses two steps. In the first step, lines are sorted according to the position of a given character in the alphabet. “A” and “a”, for example, are treated the same. Only when there are more similar lines, the second step starts. During it, “a” is sorted before “A”.

The program can be applied to other languages but it can need further modifications. The problem is that e.g. the German ß should be treated exactly as the “ss” sequence on the first step (but not on the other), so one character as two characters. There also exist languages in which two characters (like “ch”, “dz” etc.) should be treated as one character when sorting.

Coises commented 1 month ago

@grzegorj8:

Or maybe the programmers working here simply have no idea how to write a procedure that correctly sorts text? [...] I can't do that either, but at least I'm saying it straight

Would you check and see if the Locale sort in the Columns++ plugin (which I wrote) works as you expect a sort to work? You can install it from Plugins | Plugins Admin... if you’re using a recent version of Notepad++.

The sort is based on the default sort for your current Windows locale. If you require a different sort, use the Sort... dialog rather than the Sort ascending/descending (locale) commands. (Some languages have more than one sort; German, for example, has distinct “Dictionary” and “Phone Book” sorts.) In the dialog, be sure to select Sort type: Locale and set the Locale sort details at the bottom.

For the example you gave, the sort works as you say it should even if I use the “English - United States” sort, which is default for me; however, it might be that some letters only sort correctly using a Polish locale.

I know this isn’t the final solution you want. However, it is not without precedent that sometimes features first introduced in a plugin make their way into Notepad++. The sort in Columns++ is implemented quite differently than the sort in Notepad++, and I suspect it will take some “battle testing” before such a major change would be considered for inclusion in the base program.

The sorts in Notepad++ work the way people who are used to sorting plain ASCII expect them to work. When studying to implement my sorts, I concluded that outside of pure ASCII, a case-insensitive sort doesn’t make any sense unless it takes locale into consideration — which adds the complexity of giving the user the option to override the default locale when necessary. There is no single, correct sort order that works for all locales. While a locale-based case-sensitive sort is possible and well-defined, it doesn’t work the way ASCII users expect a case insensitive sort to work, and it’s not often of much use (since case in a locale-based sort is only affects the sort order of strings which have equal sort keys when case is ignored).

Again, I know it’s not the ultimate solution you want, but would you consider testing the locale sorts in Columns++ to see if they work as expected? (I only know English, so I’m just going on documentation. I have no way to know if my results are really correct.) It might be a practical first step toward getting the same principles deployed in Notepad++.

grzegorj8 commented 1 month ago

Oh, thank you! I have not known about this plugin. It works OK with a Polish text, splendid work! Really, I'm not that demanding... for me, it works fine.

So perhaps it is really time to transfer the function of this plugin into the program itself. But first the EOL bug (I mentioned at the beginning) should be fixed (it is still intact after a year and a half), and the column mode could be removed (I am not sure what it is for, but it is not very harmful).

I will do more tests in a free time, and if I don't let you know, it will mean that everything is fine, OK?

Thanks again!

alankilborn commented 1 month ago

But first the EOL bug (I mentioned at the beginning) should be fixed (it is still intact after a year and a half),

If this is what I think it is, the author has stated several times he does not feel it is a bug. And that is the end of the story; it won't get "fixed" if the author doesn't agree with it.

grzegorj8 commented 1 month ago

It seems that (at least in v8.6.7) the bug has been fixed and it is not present if you select lines you want to sort. But if you do not select anything, and choose the sort option, the last line loses the EOL (13h 10h) which is transferred at the beginning of the file.

Try the sequence I spoke about at the beginning: laka łąka mąka mika

(with ENTER at the end)

If you select all the text and next choose Sort, the program will sort the text by ASCII codes. Then do not select the text at all, and then try to sort it. The first line will become the second one, and the last one will lose its EOL.

Is it normal and expected to move EOL from the end to the beginning? Ooops...

alankilborn commented 1 month ago

Is it normal and expected to move EOL from the end to the beginning?

This is what the author wants, although I have said a few times it is really wrong behavior. Perhaps with another person (you @grzegorj8) really complaining about it, the author will be influenced to change his mind about it.