Open vigna opened 1 year ago
Hello, Indeed, you are absolutely right, at about 20 threads, the multithreaded sort becomes slower. I suppose it is because of the memory required by the LSD sort (which needs a copy of the array) I have an amd ryzen 3950x (16c/32th) with 32Go RAM and if I remember well, at about 14/16 threads, it becomes slower. However, I made the (false) assumption that not a lot of people will use this sort with a CPU with so many threads. I will add all that in the readme, but for now I am on holidays. Next week, probably. It is possible to choose the number of threads, but you have to use the sort function directly instead of the trait:
and you have to use the feature flag voracious_multithread
.
I don't know why it does not appear in the documentation.
I will also update the dependency (rayon
, needed for the multithreaded version), maybe it will help.
Actually, I'm choosing the number of threads now. The problem is that I have no idea about which number to choose 😂.
Well, it depends on the element to sort (u32, f32, etc...) and the array size (roughly). I think, you also have to set the "block_size". I have done this benchmark: https://github.com/lakwet/voracious_sort/blob/master/results/peeka_sort_vs_regions_sort.md So I think 16 threads is the best (as far as I can remember), but don't set more than what your CPU has. For the block size, maybe use what I set by default for the trait implementation, you can find all the values here:
If you clone my repository, you can play with all the code I have written to benchmark my sort:
When I will have time (I don't know when) I would like to improve my sort, because I don't beat the regions sort
Just out of curiosity—why aren't you (or anyone else in the literature, in fact) comparing with block indirect sort (the default Boost parallel unstable sort method)? It's faster than anything I've tried in C++, sometimes by a very large margin (this includes parallel radix sorts, in particular for 128-bit integers).
Because it is slower. There are benchmarks between boost block indirect sort against ips4, and ips4 is faster. And then, between regions sort and ips4, regions sort is faster. (almost on every cases if I remember well) As far as I know, regions sort is the SOTA (state of the art)
The regions paper benchmarks the parlay integer sort against IPS4 (on random integers), and parlay is faster. On all hardware I could try, block indirect is faster than parlay. So I'd be really interested to a reference to the benchmarks you're quoting...
I will look at that, maybe I read it too quickly
I'm sorry, I have to retract that. By a configuration mistake I was compiling the two benchmarks with different optimization options. Parlay is faster than block indirect. Albeit the results of my repeated benchmarks are quite weird:
Block: 446.747303s
Parlay: 494.139010s
Block: 550.888112s
Parlay: 519.433724s
Block: 587.796833s
Parlay: 507.157232s
Block: 563.625938s
Parlay: 492.325763s
This is for 40B 64-bit integers.
OK, now I have a more complete picture. Here I'm just sorting random 64-bit integers. On 1B keys, in C:
Block: 10.877221s
Parlay: 6.785890s
Block: 10.001492s
Parlay: 7.217183s
Block: 9.811406s
Parlay: 5.954816s
Block: 9.721668s
Parlay: 4.069157s
So Parlay is good (but it appears to use some additional memory). On 1B keys, in Rust:
Voracious(20) 8.511659809
Voracious(40) 6.71123186
Rayon 18.086493373
Voracious(20) 7.650926937
Voracious(40) 7.780470804
Rayon 17.084048381
Voracious(20) 7.030447277
Voracious(40) 8.034868047
Rayon 17.482521809
Voracious(20) 7.180784658
Voracious(40) 8.016780112
Rayon 19.440539774
Good, but clearly we're lagging behind the C++ sorts. Switching to 20B keys, however, is really problematic. In C:
Block: 292.837065s
Parlay: 234.913958s
Block: 309.842404s
Parlay: 231.422616s
Block: 253.400154s
Parlay: 228.669215s
Block: 245.309193s
Parlay: 203.161141s
which is coherent with the timing above. In Rust:
Voracious(20) 1038.590991388
Voracious(40) 1052.747777523
This is definitely not good. I could not run Rayon's parallel sort even with a 1GB of stack (set both with ulimit -s
and with a thread builder inside the test).
Do you have an idea why voracious scales so badly? Can I do something to help diagnosing the problem?
[Both tests run under transparent huge pages and both tests use the same optimization options.]
It turned out that the "in place" sort of Parlay is not in place at all, so please ignore those results. I was finally able to make the Rayon sort work:
Voracious(20) 1054.700242279
Voracious(40) 1004.468386745
Rayon 412.225890118
Voracious(20) 1058.039570514
Voracious(40) 1031.172554921
Rayon 419.484963512
Voracious(20) 1019.229894973
Voracious(40) 1071.464737528
Rayon 398.813977252
Voracious(20) 1033.258482431
Voracious(40) 1096.716848277
Rayon 447.616544883
So Rayon is definitely not bad, albeit not at the same level of C++.
When you say 20B keys, it is 20_000_000_000 keys ? I don't have enough RAM for this kind of test.
For unsigned int 64 for a uniform random distribution, I have 2798789us
(about 2,8s) for 1_000_000_000 keys for my peeka sort (with 16 threads, If I remember well) with my ryzen 3950x
https://github.com/lakwet/voracious_sort/blob/b73d09c8c4e7c65544a6ca7da4749fe6f237d7aa/results/peeka_sort_benchmark_1_0_0_ryzen_9_3950x_voracious_trait_u64#L133
Does you CPU have vectorization ?
Yes. Well, you're right, that's unclear, I'll use SI multipliers.
My actual workload is 160G keys (almost a TB of RAM), but 20G is sufficient to show the slowdown. I have 40 cores, so insufficiently small granularity might be an issue.
If you're interested I can give you ssh access to a server with a TB of RAM.
This is what rustc
thinks:
rustc --print cfg -C target-cpu=native
debug_assertions
panic="unwind"
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="aes"
target_feature="fxsr"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="ptr"
target_os="linux"
target_pointer_width="64"
target_vendor="unknown"
unix
I can't find a peeka_sort()
in the 1.1.1 crate. Is it in development?
For the record, I know my sort does not scale well for more than 16 threads and I don't know what happens after 1_000_000_000 64bits keys, because, I have only 32Go RAM with my Ryzen 3950x (which, for a personal computer, is already pretty correct). But your results interest me. I like challenge. So for a 160.10^9 keys (64bits) is a very interesting challenge.
I'm interested to access this server with 1T RAM. :)
OK, just send me by email an ssh key and a preferred login name.
I can't find a
peeka_sort()
in the 1.1.1 crate. Is it in development?
It is the multithreaded sort: voracious_mt_sort
but under the hood, the multithreaded sort is the peeka sort, but because of the feature flag, it does not appear in the documentation:
https://github.com/lakwet/voracious_sort/blob/b73d09c8c4e7c65544a6ca7da4749fe6f237d7aa/src/lib.rs#L430
Mmmmhhh... there are missing pieces (where's the ssh-rsa?). And I need to know which login name you want.
I just edit the comment my login name: lakwet thanks
With my work laptop (lenovo t490s, 2020) rustc --print cfg -C target-cpu=native
debug_assertions
panic="unwind"
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="adx"
target_feature="aes"
target_feature="avx"
target_feature="avx2"
target_feature="bmi1"
target_feature="bmi2"
target_feature="fma"
target_feature="fxsr"
target_feature="lzcnt"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="rdseed"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsavec"
target_feature="xsaveopt"
target_feature="xsaves"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="ptr"
target_os="linux"
target_pointer_width="64"
target_vendor="unknown"
unix
The difference is the AVX, if I remember well, unrolling uses AVX
OK. Try ssh lakwet@sexus.law.di.unimi.it
.
so far:
axelle@axelle-pc:~/.ssh$ ssh lakwet@sexus.law.di.unimi.it
lakwet@sexus.law.di.unimi.it: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
well so far with a CPU with AVX
For 5_000_000_000 i64 elements with uniform random distribution
mean of 10 iterations
Voracious sort multithreaded unstable (peeka sort): 28.42s
Rayon parallel sort unstable: 39.47s
OK but this hardware has 16 cores. sexus has 40—that's where things get interesting.
Indeed
Do you have a computer with ~40 threads with AVX ?
No. But the problem is the scaling, not AVX. I can live without AVX, the problem is that when the array gets large time does not scale linearly. I have the impression that in all recursive sorts when you have very large arrays there are a few steps in which there is little or no parallelism (e.g., first levels of QuickSort, or first level of radix sort) and this kills the performance. Block indirect sort is incredibly clever in making the global key shifting completely parallel in a fine-grained way.
Regions sort has the same idea—avoid coarse parallelism on large arrays. Maybe that's the solution. But it's very specific, whereas block indirect sort is very generic. In fact, I think that a block indirect sort using as local sort a radix sort (e.g., ska or voracious) instead of pattern-defeating quicksort could be the fastest you can get.
sexus has 40 cores but no AVX2.
Regions sort is faster than my sort, and it probably scales. Peeka sort is a reimplementation of the regions sort in Rust, but I did not manage to reach the same performance. I tried to used other fallbacks so I have another behavior, but performance are roughly the same under 1_000_000_000 elements. Concurrency in Rust has more constraints than in C++.
Another question about your needs, does the CPU 16c with AVX is slower than the sexus 40c without AVX ?
I think it's definitely slower, but I need 1TB of RAM.
I have found the problem. There was, indeed, a bug for arrays whose size is bigger than 5x10^9 I bought some RAM and now I have 96Go, so I can test roughly ~10x10^9 arrays, that is how I figure this out In the next 1.2.0 version of my crate (to be release this evening, I hope so), it is fixed. I will write a message here when I will release the new version, because I am curious of the result for your 160x10^9 array However, it still uses the AVX, but it should work anyway, just a bit slower but nothing dramatic, as it was before
I also notice, that the best number of thread is probably the number of core with its own cache memory the memory requirement makes hyperthreading (or equivalent) a bit useless maybe if your cpu with 40c has no hyperthreading, it should scale (if cache is not shared too much)
Ah, again, if you sort signed integer, there should be an upgrade possible, because, currently, the multithreaded sort is not customized by primitive types, whereas the not multithreaded sort is optimized by type. It is possible to use these optimizations on the multithreaded sort, but, you know, it takes time I maintain this crate on my personal time
Hey, no rush—I'm linking the C++ sort using the native interface and I'm perfectly happy. I reported the slowdown because it was really weird (and indeed you found a bug, so that was a useful report!).
Another thing noticed, is that I bought "slower" RAM (timing: 18 22 22 42 vs 16 19 19 39) and apparently faster RAM benefits the rayon parallel sort, probably because there are more accesses to the RAM (so it is why benchmarks are not consistent with previous version, because I have slower RAM) (benchmarks are running, it takes times...)
Yes I know, but I like challenge, it was a good one And indeed, yes, thanks for the reporting, a bug was found and fixed
I pusblish version 1.2.0 I would be very happy if you could test it on your big big array :)
Fantastic. 130 billion random keys (1TB of data) in 850s. That's two times faster than Boost's indirect block sort, four times faster than Rayon parallel PDQsort. You nailed it!
OK, admittedly I did not check that the result is correct LOL. 😂
If you want to do some testing with 1TB, I can give you access back again. Let me know...
Oh, I am so happy! I would love to do some tests with 1TB of RAM I will send you an ssh public key, but this weekend, because I have a busy week at work, thank.
Your account is still there, I just have to enable it. I need the hardware for a few days and then I'll do it.
Sorry, a student is using the hardware for this thesis. I'll let you know when it's available again...
It would be nice to have the multithreaded version to choose the number of threads. I've tried going from 2 to 40 and somewhere after 20 the sort becomes actually slower (I guess because of the cost of thread handling). Or at least some guidelines in the documentation. Thanks for a great crate!