jbellis / jvector

JVector: the most advanced embedded vector search engine
Apache License 2.0
1.5k stars 112 forks source link

Cosine metric calculation uses different precision in default provider than in Panama and native #357

Closed k-jamroz closed 1 month ago

k-jamroz commented 1 month ago

Cosine metric calculation uses different precision in final calculation in default and Panama/native providers so they produce slightly different results. Usually that should not be important, but we had some test failures after enabling Panama because different search results were returned.

The following test (after adjusting visibility for simplicity):

    @Test
    public void testCosinePrecision() {
        VectorUtilSupport dvus = new DefaultVectorUtilSupport();
        VectorUtilSupport pvus = new PanamaVectorUtilSupport();
        ArrayVectorFloat v1 = new ArrayVectorFloat(new float[]{99, 150});
        ArrayVectorFloat v2 = new ArrayVectorFloat(new float[]{99, 149});
        assertThat(pvus.cosine(v1, v2)).isEqualTo(dvus.cosine(v1, v2));
    }

produces

expected: 0.99999523f
 but was: 0.9999953f

The cause is last line of cosine metric calculation. In default provider it is:

    return (float) (sum / Math.sqrt((double) norm1 * (double) norm2));

In Panama and native provider it lacks cast to double and loses some precision:

    // float aMagnitude, bMagnitude;
    return (float) (sum / Math.sqrt(aMagnitude * bMagnitude));

For consistency and lack of surprises, all providers should cast to double and sqrt requires double anyway.

jbellis commented 1 month ago

Thanks for pointing this out. I'm pretty sure the right fix is the other way around, to use float everywhere. @jkni ?

jbellis commented 1 month ago

I do note though that if you're hoping to get 100% consistent results from floating point math, you're probably going to be disappointed anyway. See discussion here https://twitter.com/benwtrent/status/1820946252361568630

jkni commented 1 month ago

Agreed with @jbellis on all fronts -- I don't really think there's much of a reason to go double in the default provider, as we don't want to lose lane-level parallelism in the other providers. The Panama Vector API intentionally leaves the order of cross-lane reductions undefined, and these aren't truly associative with f32 lanes, so we don't expect 100% consistent results.

k-jamroz commented 1 month ago

I did not really think about order of additions (thanks for the link!), but that can be another source of surprising differences between providers. There is not much we can do with that.

I agree that using only float seems to make more sense. I created PR with that approach.