RADIXSORT_NO_REC may be too large for lexicographic sorting of arrays of arrays

jtnystrom commented 3 years ago

Hi, I wanted to report a possible performance bug in LongArrays.radixSort(long[][] a). This is sorting arrays of arrays, where each item is a tuple of longs.

This method falls back on selectionSort when the data to be sorted is below a certain size, which is hardcoded as LongArrays.RADIXSORT_NO_REC = 1024, originally in Arrays.drv. When profiling my application, I found that radixSort(long[][]) spent more than 90% of its time in selectionSort. This didn't seem right, and overall the speed was no faster than java.util.Arrays.sort with a custom comparator.

I experimentally tried lowering the RADIXSORT_NO_REC cutoff, and got performance improvements with smaller values. I have tried various powers of 2. Currently 16 is the best value I have found. For 10,000,000 random items of 2 longs each, when the constant is 1024, the time to sort converges to 1500 ms on my system. When the constant is 16, it converges to 600 ms.

I am using OpenJDK 8 on Linux 5.10; my CPU is a 4-core Core i5 3570. I get the same result when I run this code on AWS Graviton-based VMs. So far I have only tested this for longs and not for the other primitives.

Maybe this sorting method should have a different cutoff constant from the other methods in Arrays.drv? (Since RADIXSORT_NO_REC is used for several different purposes, sometimes to defer to quickSort, sometimes to insertionSortIndirect, sometimes to selection sort).

Below is a test program that I used to make the measurements.

import java.util.Random;

public class RadixTest {

    public static long[][] makeData(int items, int itemSize) {
        Random rand = new Random(1234);
        long[][] r = new long[itemSize][items];
        for (int i = 0; i < itemSize; i++) {
            for (int j = 0; j < items; j++) {
                r[i][j] = rand.nextLong();
            }
        }
        return r;
    }

    /**
     * Performance test program for lexicographic sort of arrays of arrays using Fastutil.
     *
     * Tested with these parameters: 10000000 2 5
     * The time for sorting converges to about 1500 ms for RADIXSORT_NO_REC = 1024,
     * and to 600 ms for RADIXSORT_NO_REC = 16.
     * @param args
     */
    public static void main(String[] args) {
        int numItems = Integer.parseInt(args[0]);
        int itemWidth = Integer.parseInt(args[1]);
        int numTimes = Integer.parseInt(args[2]);

        System.out.println(String.format("Testing %d times with data of size %d x %d", numTimes, numItems,
                itemWidth));

        for (int run = 0; run < numTimes; run++) {
            long[][] data = makeData(numItems, itemWidth);
            long start = System.currentTimeMillis();
            LongArrays.radixSort(data);
            long stop = System.currentTimeMillis();
            System.out.println(String.format("Time elapsed: %d ms",(stop - start)));
        }
    }
}

incaseoftrouble commented 3 years ago

Just out of curiosity:

1) Maybe fix the seed of the Random (new Random(1234)) 2) I have a feeling that it might be connected to the ratio between items and the optimal RADIXSORT_NO_REC cutoff. Could you try if, say, you multiply items by 100 if then 16 is still optimal or maybe a smaller value is better?

jtnystrom commented 3 years ago

Thanks for your comments. I fixed the random seed as you suggested and tried up to 500,000,000 items (I may try even more later when I have more memory available). 16 is still fastest. The difference between 8 and 16 seems very small however: 42.0 s vs 41.7 s, not sure it's significant. At cutoff 64, this case took 44s.

incaseoftrouble commented 3 years ago

Ah, it might be because lexicographic ordering of randomly generated strings can be decided quickly with high probability.

One more idea: Could you fill the first half of the items with fixed values (say, zero) and only the second half randomly? Something like

        Random rand = new Random(1234);
        long[][] r = new long[itemSize][items];
        for (int i = 0; i < itemSize; i++) {
            for (int j = 0; j < items; j++) {
                r[i][j] = j < items / 2 ? 0 : rand.nextLong();
            }
        }
        return r;

and then check dependence on the ratio? Its just a hunch though! Right now I don't have time to look into this deeper, unfortunately.

To clarify: I think that the optimal cutoff may depend on the cost of comparing a single pair of items which should grow with size and similarity.

jtnystrom commented 3 years ago

Hi, I generated a second test dataset as you suggested in the following way. Hopefully this is what you meant.

        Random rand = new Random(1234);
        long[][] r = new long[itemSize][items];
        for (int i = 0; i < itemSize; i++) {
            for (int j = 0; j < items; j++) {
                if (i == 0) {
                    r[i][j] = 0;
                } else {
                    r[i][j] = j < items / 2 ? 0 : rand.nextLong();
                }
            }
        }
        return r;

Let's call this the "less random" dataset. I tested more thoroughly for various values of the constant and for sizes 10,000,000, 100,000,000 and 500,000,000, for both this dataset (less random) and the original one (fully random).

For large arrays, the effect of the constant is less significant. At 500 million items I get 44s at cutoff 1024, and 42s at cutoff 16 for the random dataset. I get 35s at cutoff 1024, and 41s at cutoff 16 for the less random dataset.
At 100 million items, cutoff 1024 is a problematic case for the less random dataset (26s, vs 6.3s for 256 or below). For the fully random dataset, 1024 worked fine (6.7s)
At 10 million items, the fully random dataset takes 1500 ms in the range 1024-256. At 64 or below it takes around 600 ms (this is the problem case I originally found). The less random dataset takes 800 ms in the range 1024-256 and <=700 ms at 64 or below.

My application sorts relatively small arrays of arrays repeatedly, so personally that is the case I care about. I'm not sure how to fully interpret these findings, but based on just these results it seems that a value of 64 would help my case and also wouldn't hurt any of the other cases. Of course I can just have a custom version of this method in my application, so I'm not in a rush to have this changed.

vigna commented 2 years ago

Sorry to get back to this so late.

I think there's a kind of copy-paste bug here. The RADIXSORT_NO_REC threshold is supposed to switch to quicksort. And that's what happens with the basic radixSort method. For some reason, the other implementations switch to a quadratic sort, which is crazy. I'll try to fix this, but it would be great if you could test it afterwards.

jtnystrom commented 2 years ago

Hi @vigna, sounds great, I'll be happy to test it anytime.

vigna commented 2 years ago

So, after examining the code I think I know what happened. The problem is that for the methods you are using there is no quicksort implementation. I wrote a selection sort implementation to fix that, but somehow I did not realize that the threshold could not possibly be the same. So I moved to quicksort when possible. It is actually very subtle because in some cases (like, indirect) you have to stabilize after quicksorting because indirect radix sort can be made stable.

In retrospect, I don't know whether it even makes sense—it is possible that a standard indirect radix sort followed by stabilization is faster. Just, I can't benchmark everything :(.

bf3892a519fa6c71b868db8c338368dd464d86a1 should fix the bug you reported.

jtnystrom commented 2 years ago

Hi @vigna, I tested your change and it works totally fine. No problems. Thanks again for your work on this comprehensive library.

vigna / fastutil

RADIXSORT_NO_REC may be too large for lexicographic sorting of arrays of arrays #255