Open LaurenSpiegel opened 8 years ago
Interesting. this sounds like it is 1000 times too slow :D
Looking at the code, I'm wondering about three points: 1) Does the fact that we're doing an in-place shuffle rather than building a new array have an impact on "shuffling" time ? 2) Here, we're swapping values potentially multiple times within the array before we're done. Could it have an impact also ? 3) The fact that we're using the cryptographically secure randRange, with all the small computations that we're asking Node.JS to do, and that loops over parseInt(GenerateSomeBinarydata().toString('hex'), 16) may really not be an optimal way of shuffling an array for a simple bootstrap list....
Luckily, this shuffling should only be done only once at startup over a bootstraplist, and as such should not be much of a pain in runtime. Can you confirm this for me, @LaurenSpiegel ?
@DavidPineauScality, yes this shuffle function is only used when S3 starts up and instantiates a bucketClient and an sproxydclient. It is not used in the metadata repo (they have their own shuffle function in dbapi). Note that sproxydclient has the same function in its repo rather than calling from Arsenal so that will have to be updated to point to the super new Arsenal shuffle when it is updated. I will create an issue in sproxydclient linking to this issue.
Let's see...
function shuffle(array) {
for (let i = array.length - 1; i > 0; i--) {
const randIndex = randomRange(0, i);
const randIndexVal = array[randIndex];
array[randIndex] = array[i];
array[i] = randIndexVal;
}
return array;
};
Let's assume our input array is of length 8. Here goes:
function randomRange(min, max) {
if (max < min) {
throw new Error("Invalid range");
}
if (min === max) {
return min;
} else if (min < max) {
const range = (max - min);
const bits = bitsNeeded(range);
// decide how many bytes we need to draw from nextBytes: drawing less
// bytes means being more efficient
const bytes = bitsToBytes(bits);
// we use a mask as an optimization: it increases the chances for the
// candidate to be in range
const mask = createMaskOnes(bits);
let candidate;
do {
candidate = parseInt(nextBytes(bytes).toString('hex'), 16) & mask;
} while (candidate > range);
return (candidate + min);
}
}
This is called with (min=0, max=8)
so we move into the last block. range
becomes 8, then we calculate the number of bits that are required to represent this number:
function bitsNeeded(number) {
if (number < 0) {
throw new Error("Input must be greater than or equal to zero");
} else if (number === 0) {
return 1;
} else {
return Math.floor(Math.log2(number)) + 1;
}
}
OK, 8 = 2 \ 3, so bits
becomes 4. Next step:
function bitsToBytes(numBits) {
if (numBits < 0) {
throw new Error("Input must be greater than or equal to zero");
}
return Math.ceil(numBits / 8);
}
so bytes
is 1, as expected. Finallly, the mask calculation:
function createMaskOnes(numBits) {
if (numBits < 0)
throw new Error("Input must be greater than or equal to zero");
return Math.pow(2, numBits) - 1;
}
and mask
is 15 (or 01111 in binary). Now, we go in the loop. Let's assume nextBytes
, toString
and parseInt
are infinitely fast (they're not, but that's beside the point). We take the result, which is some random integer, and &
it with mask
. Then, until candidate <= range
(with range = 8
), we repeat the loop.
Consider what's going on here. The parseInt(nextBytes(bytes).toString('hex'), 16)
call gives us some random number (in this case within [0..255]
because bytes = 1
). After the &
with mask
, in this case mask
being 1111
(binary), we get a number which is somewhere between 0 and... 15 (inclusive). So there's at every iteration an almost-50% chance we have to do one more iteration. If you're "unlucky", it might take a while to hit a 'matching' random value.
Or nextBytes
might be a bit slow of course, that could also explain.
What you could consider doing here is implement a very simple (and fast) PRNG (minstd/Lehmer could do perfectly well to shuffle something, see https://en.wikipedia.org/wiki/Lehmer_random_number_generator), seed it with a 'cryptographic' random number (or even a plain timestamp), and simply modulo the PRNG result every time so the generated value fits the range you're aiming for.
Also, there must be a better/more efficient way to turn a couple of bytes into the number they represent besides turning them into a hex-encoded string and parsing it base-16?
Edit: Of course 8 is a bit of a 'worst case' (and so is every power of 2) causing a near-50% miss rate (or 50% chance the loop spins). Any 2 ** N - 1
(e.g. 15) range
needs only a single iteration, and others are somewhere in-between. But because the shuffle
function calculates a randomRange(0, N)
for every N
from the length of the input array to 0, you hit all cases, including the ones most likely to spin.
On some runs, the shuffle function can take 1-2 ms to complete with an array of just 5 elements.