intel / x86-simd-sort

C++ template library for high performance SIMD based sorting algorithms
BSD 3-Clause "New" or "Revised" License
843 stars 53 forks source link

Merge sort and AVX512 #101

Open sergii-mamedov opened 10 months ago

sergii-mamedov commented 10 months ago

Hi!

Many thanks for your contribution to speeding up sorting in numpy. Wanted to ask if there are any plans to speed up merge/tim sort with AVX2/AVX512?

r-devulap commented 10 months ago

Hi @sergii-mamedov that isn't in my to do list (at least not yet) unless there is a good enough justification to work on those. Is there a reason why merge/tim sort will be preferred over quicksort?

sergii-mamedov commented 10 months ago

Hi @r-devulap I was supported the METASPACE project. Sorting is one of the most time-consuming steps. We have from a hundred to ten thousand already sorted arrays (m/z spectra), which we need to combine into one array and sort. The total size of arrays ranges from hundreds of megabytes to hundreds of gigabytes. I tested quicksort vs mergesort (in numpy terminology) two years ago - mergesort was 5-10 times faster. Last weekend I tested the latest release of numpy - mergesort is still about twice as fast on my data. Also, a feature of merge/tim sort is that it is a stable variant of sorting, that is, the order of elements is determined. This is important for this task.

r-devulap commented 10 months ago

curious as to what data type and distribution you have in your application. On NumPy benchmarks, quicksort is slower than merge sort only for ordered and reverse arrays:

             quick   float64        ('random',)         37.9±0.02μs
              quick   float64        ('ordered',)         37.3±0.4μs
              quick   float64       ('reversed',)         37.8±0.2μs
              quick   float64        ('uniform',)        5.44±0.04μs
              quick   float64    ('sorted_block', 10)    36.2±0.08μs
              quick   float64   ('sorted_block', 100)    39.7±0.05μs
              quick   float64   ('sorted_block', 1000)    43.5±0.1μs
              merge   float64        ('random',)          609±0.7μs
              merge   float64        ('ordered',)        18.3±0.03μs
              merge   float64       ('reversed',)        12.7±0.04μs
              merge   float64        ('uniform',)        18.3±0.03μs
              merge   float64    ('sorted_block', 10)      169±1μs
              merge   float64   ('sorted_block', 100)     123±0.8μs
              merge   float64   ('sorted_block', 1000)   68.2±0.06μs