lakwet / voracious_sort

Voracious radix sort
MIT License
58 stars 1 forks source link

Best number of threads? #11

Open vigna opened 1 year ago

vigna commented 1 year ago

It would be nice to have the multithreaded version to choose the number of threads. I've tried going from 2 to 40 and somewhere after 20 the sort becomes actually slower (I guess because of the cost of thread handling). Or at least some guidelines in the documentation. Thanks for a great crate!

lakwet commented 1 year ago

Hello, Indeed, you are absolutely right, at about 20 threads, the multithreaded sort becomes slower. I suppose it is because of the memory required by the LSD sort (which needs a copy of the array) I have an amd ryzen 3950x (16c/32th) with 32Go RAM and if I remember well, at about 14/16 threads, it becomes slower. However, I made the (false) assumption that not a lot of people will use this sort with a CPU with so many threads. I will add all that in the readme, but for now I am on holidays. Next week, probably. It is possible to choose the number of threads, but you have to use the sort function directly instead of the trait:

and you have to use the feature flag voracious_multithread. I don't know why it does not appear in the documentation.

I will also update the dependency (rayon, needed for the multithreaded version), maybe it will help.

vigna commented 1 year ago

Actually, I'm choosing the number of threads now. The problem is that I have no idea about which number to choose 😂.

lakwet commented 1 year ago

Well, it depends on the element to sort (u32, f32, etc...) and the array size (roughly). I think, you also have to set the "block_size". I have done this benchmark: https://github.com/lakwet/voracious_sort/blob/master/results/peeka_sort_vs_regions_sort.md So I think 16 threads is the best (as far as I can remember), but don't set more than what your CPU has. For the block size, maybe use what I set by default for the trait implementation, you can find all the values here:

If you clone my repository, you can play with all the code I have written to benchmark my sort:

lakwet commented 1 year ago

When I will have time (I don't know when) I would like to improve my sort, because I don't beat the regions sort

vigna commented 1 year ago

Just out of curiosity—why aren't you (or anyone else in the literature, in fact) comparing with block indirect sort (the default Boost parallel unstable sort method)? It's faster than anything I've tried in C++, sometimes by a very large margin (this includes parallel radix sorts, in particular for 128-bit integers).

lakwet commented 1 year ago

Because it is slower. There are benchmarks between boost block indirect sort against ips4, and ips4 is faster. And then, between regions sort and ips4, regions sort is faster. (almost on every cases if I remember well) As far as I know, regions sort is the SOTA (state of the art)

vigna commented 1 year ago

The regions paper benchmarks the parlay integer sort against IPS4 (on random integers), and parlay is faster. On all hardware I could try, block indirect is faster than parlay. So I'd be really interested to a reference to the benchmarks you're quoting...

lakwet commented 1 year ago

I will look at that, maybe I read it too quickly

vigna commented 1 year ago

I'm sorry, I have to retract that. By a configuration mistake I was compiling the two benchmarks with different optimization options. Parlay is faster than block indirect. Albeit the results of my repeated benchmarks are quite weird:

Block:  446.747303s
Parlay: 494.139010s
Block:  550.888112s
Parlay: 519.433724s
Block:  587.796833s
Parlay: 507.157232s
Block:  563.625938s
Parlay: 492.325763s

This is for 40B 64-bit integers.

vigna commented 1 year ago

OK, now I have a more complete picture. Here I'm just sorting random 64-bit integers. On 1B keys, in C:

Block:  10.877221s
Parlay: 6.785890s
Block:  10.001492s
Parlay: 7.217183s
Block:  9.811406s
Parlay: 5.954816s
Block:  9.721668s
Parlay: 4.069157s

So Parlay is good (but it appears to use some additional memory). On 1B keys, in Rust:

Voracious(20) 8.511659809
Voracious(40) 6.71123186
Rayon         18.086493373
Voracious(20) 7.650926937
Voracious(40) 7.780470804
Rayon         17.084048381
Voracious(20) 7.030447277
Voracious(40) 8.034868047
Rayon         17.482521809
Voracious(20) 7.180784658
Voracious(40) 8.016780112
Rayon         19.440539774

Good, but clearly we're lagging behind the C++ sorts. Switching to 20B keys, however, is really problematic. In C:

Block:  292.837065s
Parlay: 234.913958s
Block:  309.842404s
Parlay: 231.422616s
Block:  253.400154s
Parlay: 228.669215s
Block:  245.309193s
Parlay: 203.161141s

which is coherent with the timing above. In Rust:

Voracious(20) 1038.590991388
Voracious(40) 1052.747777523

This is definitely not good. I could not run Rayon's parallel sort even with a 1GB of stack (set both with ulimit -s and with a thread builder inside the test).

Do you have an idea why voracious scales so badly? Can I do something to help diagnosing the problem?

[Both tests run under transparent huge pages and both tests use the same optimization options.]

vigna commented 1 year ago

It turned out that the "in place" sort of Parlay is not in place at all, so please ignore those results. I was finally able to make the Rayon sort work:

Voracious(20) 1054.700242279
Voracious(40) 1004.468386745
Rayon         412.225890118
Voracious(20) 1058.039570514
Voracious(40) 1031.172554921
Rayon         419.484963512
Voracious(20) 1019.229894973
Voracious(40) 1071.464737528
Rayon         398.813977252
Voracious(20) 1033.258482431
Voracious(40) 1096.716848277
Rayon         447.616544883

So Rayon is definitely not bad, albeit not at the same level of C++.

lakwet commented 1 year ago

When you say 20B keys, it is 20_000_000_000 keys ? I don't have enough RAM for this kind of test. For unsigned int 64 for a uniform random distribution, I have 2798789us (about 2,8s) for 1_000_000_000 keys for my peeka sort (with 16 threads, If I remember well) with my ryzen 3950x https://github.com/lakwet/voracious_sort/blob/b73d09c8c4e7c65544a6ca7da4749fe6f237d7aa/results/peeka_sort_benchmark_1_0_0_ryzen_9_3950x_voracious_trait_u64#L133

lakwet commented 1 year ago

Does you CPU have vectorization ?

vigna commented 1 year ago

Yes. Well, you're right, that's unclear, I'll use SI multipliers.

My actual workload is 160G keys (almost a TB of RAM), but 20G is sufficient to show the slowdown. I have 40 cores, so insufficiently small granularity might be an issue.

If you're interested I can give you ssh access to a server with a TB of RAM.

vigna commented 1 year ago

This is what rustc thinks:

rustc --print cfg -C target-cpu=native
debug_assertions
panic="unwind"
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="aes"
target_feature="fxsr"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="ptr"
target_os="linux"
target_pointer_width="64"
target_vendor="unknown"
unix
vigna commented 1 year ago

I can't find a peeka_sort() in the 1.1.1 crate. Is it in development?

lakwet commented 1 year ago

For the record, I know my sort does not scale well for more than 16 threads and I don't know what happens after 1_000_000_000 64bits keys, because, I have only 32Go RAM with my Ryzen 3950x (which, for a personal computer, is already pretty correct). But your results interest me. I like challenge. So for a 160.10^9 keys (64bits) is a very interesting challenge.

I'm interested to access this server with 1T RAM. :)

vigna commented 1 year ago

OK, just send me by email an ssh key and a preferred login name.

lakwet commented 1 year ago

I can't find a peeka_sort() in the 1.1.1 crate. Is it in development?

It is the multithreaded sort: voracious_mt_sort but under the hood, the multithreaded sort is the peeka sort, but because of the feature flag, it does not appear in the documentation: https://github.com/lakwet/voracious_sort/blob/b73d09c8c4e7c65544a6ca7da4749fe6f237d7aa/src/lib.rs#L430

vigna commented 1 year ago

Mmmmhhh... there are missing pieces (where's the ssh-rsa?). And I need to know which login name you want.

lakwet commented 1 year ago

I just edit the comment my login name: lakwet thanks

lakwet commented 1 year ago

With my work laptop (lenovo t490s, 2020) rustc --print cfg -C target-cpu=native

debug_assertions
panic="unwind"
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="adx"
target_feature="aes"
target_feature="avx"
target_feature="avx2"
target_feature="bmi1"
target_feature="bmi2"
target_feature="fma"
target_feature="fxsr"
target_feature="lzcnt"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="rdseed"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsavec"
target_feature="xsaveopt"
target_feature="xsaves"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="ptr"
target_os="linux"
target_pointer_width="64"
target_vendor="unknown"
unix

The difference is the AVX, if I remember well, unrolling uses AVX

vigna commented 1 year ago

OK. Try ssh lakwet@sexus.law.di.unimi.it.

lakwet commented 1 year ago

so far:

axelle@axelle-pc:~/.ssh$ ssh lakwet@sexus.law.di.unimi.it
lakwet@sexus.law.di.unimi.it: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
lakwet commented 1 year ago

well so far with a CPU with AVX Capture d’écran du 2023-03-09 17-14-15 For 5_000_000_000 i64 elements with uniform random distribution mean of 10 iterations Voracious sort multithreaded unstable (peeka sort): 28.42s Rayon parallel sort unstable: 39.47s

vigna commented 1 year ago

OK but this hardware has 16 cores. sexus has 40—that's where things get interesting.

lakwet commented 1 year ago

Indeed

lakwet commented 1 year ago

Do you have a computer with ~40 threads with AVX ?

vigna commented 1 year ago

No. But the problem is the scaling, not AVX. I can live without AVX, the problem is that when the array gets large time does not scale linearly. I have the impression that in all recursive sorts when you have very large arrays there are a few steps in which there is little or no parallelism (e.g., first levels of QuickSort, or first level of radix sort) and this kills the performance. Block indirect sort is incredibly clever in making the global key shifting completely parallel in a fine-grained way.

Regions sort has the same idea—avoid coarse parallelism on large arrays. Maybe that's the solution. But it's very specific, whereas block indirect sort is very generic. In fact, I think that a block indirect sort using as local sort a radix sort (e.g., ska or voracious) instead of pattern-defeating quicksort could be the fastest you can get.

vigna commented 1 year ago

sexus has 40 cores but no AVX2.

lakwet commented 1 year ago

Regions sort is faster than my sort, and it probably scales. Peeka sort is a reimplementation of the regions sort in Rust, but I did not manage to reach the same performance. I tried to used other fallbacks so I have another behavior, but performance are roughly the same under 1_000_000_000 elements. Concurrency in Rust has more constraints than in C++.

lakwet commented 1 year ago

Another question about your needs, does the CPU 16c with AVX is slower than the sexus 40c without AVX ?

vigna commented 1 year ago

I think it's definitely slower, but I need 1TB of RAM.

lakwet commented 1 year ago

I have found the problem. There was, indeed, a bug for arrays whose size is bigger than 5x10^9 I bought some RAM and now I have 96Go, so I can test roughly ~10x10^9 arrays, that is how I figure this out In the next 1.2.0 version of my crate (to be release this evening, I hope so), it is fixed. I will write a message here when I will release the new version, because I am curious of the result for your 160x10^9 array However, it still uses the AVX, but it should work anyway, just a bit slower but nothing dramatic, as it was before

lakwet commented 1 year ago

I also notice, that the best number of thread is probably the number of core with its own cache memory the memory requirement makes hyperthreading (or equivalent) a bit useless maybe if your cpu with 40c has no hyperthreading, it should scale (if cache is not shared too much)

lakwet commented 1 year ago

Ah, again, if you sort signed integer, there should be an upgrade possible, because, currently, the multithreaded sort is not customized by primitive types, whereas the not multithreaded sort is optimized by type. It is possible to use these optimizations on the multithreaded sort, but, you know, it takes time I maintain this crate on my personal time

vigna commented 1 year ago

Hey, no rush—I'm linking the C++ sort using the native interface and I'm perfectly happy. I reported the slowdown because it was really weird (and indeed you found a bug, so that was a useful report!).

lakwet commented 1 year ago

Another thing noticed, is that I bought "slower" RAM (timing: 18 22 22 42 vs 16 19 19 39) and apparently faster RAM benefits the rayon parallel sort, probably because there are more accesses to the RAM (so it is why benchmarks are not consistent with previous version, because I have slower RAM) (benchmarks are running, it takes times...)

Yes I know, but I like challenge, it was a good one And indeed, yes, thanks for the reporting, a bug was found and fixed

lakwet commented 1 year ago

I pusblish version 1.2.0 I would be very happy if you could test it on your big big array :)

vigna commented 1 year ago

Fantastic. 130 billion random keys (1TB of data) in 850s. That's two times faster than Boost's indirect block sort, four times faster than Rayon parallel PDQsort. You nailed it!

OK, admittedly I did not check that the result is correct LOL. 😂

If you want to do some testing with 1TB, I can give you access back again. Let me know...

lakwet commented 1 year ago

Oh, I am so happy! I would love to do some tests with 1TB of RAM I will send you an ssh public key, but this weekend, because I have a busy week at work, thank.

vigna commented 1 year ago

Your account is still there, I just have to enable it. I need the hardware for a few days and then I'll do it.

vigna commented 1 year ago

Sorry, a student is using the hardware for this thesis. I'll let you know when it's available again...