jjk-jacky / natsort

Natural Sorting utility
GNU General Public License v3.0
3 stars 0 forks source link

Errors when sorting text with CRLF line endings #1

Open AlisterH opened 6 years ago

AlisterH commented 6 years ago

Hi, I realise you're not actively developing natsort, but I figure it is at least worth filing this for the record:

Natsort certainly gives better results than coreutils sort -V, or anything I can get out of msort. But it seems that it makes some mistakes when the input is from a file with Windows line endings (i.e. CRLF) - see below.

Or perhaps there is something obvious that I'm missing (I guess something about what cat does)? It seems that coreutils sort -V also gives worse results when operating on CRLF files. The Python natsort doesn't seem to.


$  cat test3|/usr/local/bin/natsort
Callisto Morphamax 6000 SE
Callisto Morphamax 6000 SE2

$  cat test.txt|/usr/local/bin/natsort

Callisto Morphamax 6000 SE2
Callisto Morphamax 6000 SE

$  diff test3 test.txt
1,2c1,2
< Callisto Morphamax 6000 SE
< Callisto Morphamax 6000 SE2
---
> Callisto Morphamax 6000 SE
> Callisto Morphamax 6000 SE2

$  file test3
test3: ASCII text

$  file test.txt
test.txt: ASCII text, with CRLF line terminators

$ python -c "print(repr(open('test.txt').read()))"
'Callisto Morphamax 6000 SE\r\nCallisto Morphamax 6000 SE2\r\n'

$ python -c "print(repr(open('test3').read()))"
'Callisto Morphamax 6000 SE\nCallisto Morphamax 6000 SE2\n'
jjk-jacky commented 6 years ago

Hey,

Right, I think that's because indeed natsort does expect LF-ending lines, and only strips the LF at the end of each lines.

So when in fact given lines with CRLF what happens is that the lines are treated as e.g. not "foo" & "foo2" but "foo\r" and "foo2\r" -- causing the results you see.

I guess you could patch natsort to also remove any trailing \r, though it should be noted that your results would be "converted" to LF line endings. (Or you'd need a flag/option for CRLF lien endings.)

Might be better/simpler though to simply convert your files to LF line endings before sorting.

Cheers,

AlisterH commented 6 years ago

Hi. Yes, it is easy enough to convert your files as long as you know that it is necessary. After discussing this elsewhere and looking more closely at the behaviour of sort and other standard *nix tools, I believe that the current behaviour is "correct" because it is consistent with other standard tools. I guess the issue is that at least the python implementation of natsort (which also provides a command line tool called natsort) strips all trailing whitespace before sorting, so is in this respect incompatible. I think it would be worth adding a note to the documentation something like this:

This implementation of natsort expects LF line endings and will produce unexpected results if operating on "Windows" format files with CRLF line endings. This differs from other implementations of natsort, which strip all trailing whitespace before sorting.

People could still get into trouble by inadvertently switching to using your implementation, but at least that is less likely if it is documented.