Add ability to sort GPRT Buffers on the GPU

This PR adds the ability to sort GPRT buffers on the GPU using a four way radix sort like that described here: https://www.semanticscholar.org/paper/Fast-4-way-parallel-radix-sorting-on-GPUs-Ha-Kr%C3%BCger/eaa887377239049f2f6d55f23830ce5f2bb6f38c

With this PR we add two functions, a gprtBufferSort and a gprtBufferSortPayload. At the moment, these two routines only work on 32 bit integer keys (though we could easily extend this to 64 bit integers, floats, doubles...). The payload is also currently constrained to a 32 bit value, but the type can be anything.

These functions take as input buffers (keys, or keys and values) to sort in place, and a "scratch" buffer. This scratch buffer holds additional resources required for the sort (a second buffer to scatter values into, for count scanning and reduction). If this scratch buffer is recycled between sorts, expensive allocation of this intermediate data can be avoided, further improving performance.

Intentions of this PR are to enable LBVH construction down the road for things like untruncated KNN queries.

gprt-org / GPRT

Add ability to sort GPRT Buffers on the GPU #70