nessex / rdst

A flexible, native Rust implementation of unstable radix sort.
Apache License 2.0
4 stars 1 forks source link

Best algorithm for random integers #9

Open vigna opened 1 year ago

vigna commented 1 year ago

Based on your experience and testing, do you have a suggested best tuning for sorting random 128-bit keys?

nessex commented 12 months ago

128 bit keys are hard... The defaults in the library are pretty much the best I was able to achieve for random, well distributed 128 bit data on a few different devices I had on hand. That said, it varies considerably based upon:

Are you able to provide any more details on these? Particularly a type definition if the keys are embedded in a struct or tuple. I'll try to replicate your case and see if there's a better combo.

I do know that voracious_sort does a little better on 128 bit keys most of the time. Presumably this is due to the way the byte lookups for keys have been unrolled. I wasn't able to entirely eliminate the cost of having the additional flexibility around keys that rdst provides, so voracious_sort may be better here if it's compatible with your dataset.

If you have fairly large structs / tuples that contain the keys, you could try utilizing the Regions sort in this library via a custom tuner. I didn't ever get the performance I expected out of it for plain integer slices, but for larger structs it does theoretically minimize the expensive copying of data a little. Ska sort might also be good here on the single-threaded side. These also sometimes come out on top when the data is unevenly distributed.

For a custom tuner for very large 128 bit key datasets, you'll likely want the same pattern as the StandardTuner:

For that first byte, there are three multi-threaded choices:

You can see more detail on each algorithm at the top of their respective files under sorts/ in this repo. The standard tuner typically chooses Scanning sort for the largest datasets >50M keys, then Recombinating sort for anything between 260K and 50M keys. From there, as the workload is broken down into distict workloads it uses Ska sort and finally LSB sort on each core as they are single-threaded and much more computationally efficient. This worked out to be the best on average across an M1 & M2 Pro Macbook, a m6g.metal AWS Graviton 2 (64 core 512GB RAM), a low-power Xeon, and a Ryzen 3600x for pure Vec<u32|u64|u128>. While not 100% optimal on all of them, the difference was generally negligible.

For larger structs, theoretically you may have better luck with Regions sort and Ska sort as they are more efficient memory-wise. If it's just Vec<u128>, you're only likely to see more marginal improvements over the StandardTuner based upon your hardware by changing the thresholds or switching out the algorithms.

Let me know if you can provide any more details on your situation and I can try to test out a few options!

nessex commented 12 months ago

Ah, I just followed the trail of GitHub issues. This tuning is worth trying:

struct MyTuner;

impl Tuner for MyTuner {
    fn pick_algorithm(&self, p: &TuningParams, _counts: &[usize]) -> Algorithm {
        match p.level {
            15 => Algorithm::Scanning,
            _ => Algorithm::Ska,
        }
    }
}

Also worth trying:

For larger item counts, I think Regions will become useful. For smaller item counts on my M1 MBP, Ska sort actually wins when used at all levels, I presume because the structs are a little larger and Ska sort does less copying.

Full benchmark for testing ```rust use block_pseudorand::block_rand; use criterion::{ black_box, criterion_group, criterion_main, AxisScale, BatchSize, BenchmarkId, Criterion, PlotConfiguration, Throughput, }; use rdst::{RadixKey, RadixSort}; use std::time::Duration; use rdst::tuner::{Algorithm, Tuner, TuningParams}; const TOTAL_ITEMS: usize = 160_000_000; struct MyTuner; impl Tuner for MyTuner { fn pick_algorithm(&self, p: &TuningParams, _counts: &[usize]) -> Algorithm { match p.level { 15 => Algorithm::Scanning, // Also worth trying, particularly if you increase the count // 15 => Algorithm::Regions, _ => Algorithm::Ska, } } } #[derive(Debug, Clone, Copy)] pub struct ComplexStruct { pub key: [u64; 2], pub val: u64, } impl RadixKey for ComplexStruct { const LEVELS: usize = 16; #[inline] fn get_level(&self, level: usize) -> u8 { (self.key[1 - level / 8] >> (level % 8) * 8) as u8 } } fn gen_input(n: usize) -> Vec { let data: Vec = block_rand(n); let data_2: Vec = block_rand(n); let data_3: Vec = block_rand(n); (0..n) .into_iter() .map(|i| ComplexStruct { key: [data[i], data_2[i]], val: data_3[i], }) .collect() } fn full_sort_complex(c: &mut Criterion) { let mut input_sets: Vec<(&str, Vec)> = vec![ ("random", gen_input(TOTAL_ITEMS)), ]; let tests: Vec<(&str, Box)>)> = vec![ ( "rdst_custom", Box::new(|mut input| { input .radix_sort_builder() .with_tuner(&MyTuner {}) .sort(); black_box(input); }), ), ( "rdst_standard", Box::new(|mut input| { input .radix_sort_builder() .sort(); black_box(input); }), ), ]; let mut group = c.benchmark_group("complex_sort"); group.sample_size(10); group.measurement_time(Duration::from_secs(5)); group.warm_up_time(Duration::from_secs(5)); group.plot_config(PlotConfiguration::default().summary_scale(AxisScale::Logarithmic)); for set in input_sets.iter_mut() { let l = set.1.len(); group.throughput(Throughput::Elements(l as u64)); for t in tests.iter() { let name = format!("{} {}", set.0, t.0); group.bench_with_input(BenchmarkId::new(name, l), set, |bench, set| { bench.iter_batched(|| set.1.clone(), &*t.1, BatchSize::SmallInput); }); } } group.finish(); } criterion_group!(complex_sort, full_sort_complex,); criterion_main!(complex_sort); ``` And in Cargo.toml: ```toml [[bench]] name = "complex_sort" harness = false ``` Run with: ``` RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo bench --profile=release --features=bench complex_sort ```

With StandardTuner and 160M items on an M1 MBP:

Benchmarking complex_sort/random rdst_standard/160000000: Collecting 10 samples in estimated 23.215 s (10 itercomplex_sort/random rdst_standard/160000000
                        time:   [1.3425 s 1.5079 s 1.8094 s]
                        thrpt:  [88.427 Melem/s 106.10 Melem/s 119.18 Melem/s]

With that custom tuner:

Benchmarking complex_sort/random rdst_custom/160000000: Collecting 10 samples in estimated 27.631 s (10 iteratcomplex_sort/random rdst_custom/160000000
                        time:   [1.2497 s 1.2860 s 1.3312 s]
                        thrpt:  [120.19 Melem/s 124.42 Melem/s 128.03 Melem/s]

That structure of specifically tuning around levels is convenient if you are tuning for a single target. But once you go deeper into supporting varying sizes of input array you'll want to also deviate based upon the number of items being sorted.