Open alexmaco opened 6 years ago
Dolphin has a very fast natural sorting algorithm implemented in C++, if someone is interested to look that up.
Edit: They first do a classic sort, then the natural sort using a collator. https://web.archive.org/web/20160801141911/https://git.reviewboard.kde.org/r/113485/
That's definitely worth a profile. My hunch is that natord will still show up there somewhat, at least due to the fuse
ed iterator. Nevertheless, the dolphin approach works nicely in either case.
I wonder if we could leverage some of the code in fd to improve the performance of the tree command. This could also solve the problem of correctly respecting gitignore, as mentioned in #136.
I have compared ls
, ls -v
and exa --sort=Name
, exa
in million entry directory which made by these fish script:
for i in (seq 1000000)
touch "a$i"
end
directory result of ls
is equal to exa --sort=Name
and also ls -v
is equal to exa
. this is result:
ls :
real 0m17.456s
user 0m6.311s
sys 0m2.851s
exa --sort=Name :
real 13m37.893s
user 13m30.177s
sys 0m3.222s
ls -v :
real 0m13.038s
user 0m7.716s
sys 0m2.605s
exa :
real 13m42.330s
user 13m27.377s
sys 0m3.202s
I guess this is not serious technical test but I am afraid that the result definitely should not be affected by git status or natural sorting. or even color.
Dolphin has a very fast natural sorting algorithm implemented in C++, if someone is interested to look that up.
Edit: They first do a classic sort, then the natural sort using a collator. https://git.reviewboard.kde.org/r/113485/
Hint: Java uses https://en.wikipedia.org/wiki/Timsort as a default. It is very good when there is some pre-sorted parts and is usually better than any other algorithm. I do not know which sorting algorithm rust uses. It is similar to what you describe above, but I think TimSort is still faster.
I created a test crate and saw huge performance improvements by using a regular sort before using natord (sorting 466851 or 4603 names), but it don’t seem to do any difference when implemented in exa (listing a directory with 4603 files).
I re-read the initial post and believe performance improvements around git implementation should be treated in the library exa is using, libgit2. See libgit2/libgit2/issues/4230.
I don’t really find useful information about efficient natural sorting algorithms, it’s annoying. We could try to use bindings to ICU and use their collator, but it may not be easy to handle correctly locales, and it’s a big dependency.
So, after some quick measurements it turns out that in almost all cases i can test, exa is currently either a bit or a lot slower than ls and tree. Since we've got a pretty good language as a tool, I think becoming actually faster than ls is achievable (or at least par performance).
I've done some profiling and the bottlenecks that seem to be most impactful are:
First off, as far as sorting goes, I've done some profiling, and when producing large amounts of output that has to be sorted with the natord functions, they end up taking a significant chunk of time. I may attach some perf data at a later point, but i don't have an up-to-date dump.
Looking at the natord crate, it becomes clear why this is: it's written quite elegantly but without an explicit performance concern, which ends up being a limiting factor in exa.
I think a better natural sorting implementation can shave at least 30-40% off the runtime for large outputs, but, as before, I don't have up-to-date numbers to back this up right now.
Natord doesn't look to be maintained however. Would you be willing to merge a PR that replaces natord with a faster reimplementation of the same functionality ?
The only alternative i see would be to bug the natord maintainer to merge basically a complete rework of that crate, including interface changes (iterators have quite some unnecessary overhead in the exa use case).