Fix MinRange intialization

meiravgri commented 7 months ago

Fix1: The standard function std::numeric_limits<float>::min() returns the lowest positive floating point value. this approach might fail to find the actual maximum in vectors containing only negative values. The solution is to use std::numeric_limits<float>::lowest() instead

hildebrandmw commented 7 months ago

Thank you for the PR @meiravgri! Yup, this is indeed initialized wrong - numeric_limits::lowest has a difficult time sticking in my brain :smile:.

Before I merge this PR, can I ask you to fill out an appropriate CLA and email it to svs-maintainers@intel.com? If that introduces too much overhead, I can fix this manually and push the updated code, but I'd prefer you get credit for your contribution!

meiravgri commented 7 months ago

Hey @hildebrandmw I noticed in the contributing guidelines contributing guidelines that it can be signed via pull request. However, I couldn't locate this option. Could you please guide me on how to proceed with signing the CLA? If signing via email is required, could you clarify the expected format? I'm also wondering if a digital signature is acceptable. Thank you

meiravgri commented 7 months ago

@hildebrandmw Hi again :) Found another small mistake in the decompressor of the residual (missing -1 in the denominator)

hildebrandmw commented 7 months ago

Hi! Sorry about the delay on our end. Finding an answer on what we can accept as far as CLAs are concerned is turning out to be more difficult than anticipated. I apologize for the delay and appreciate your patience! Thank you for the time you've put into looking through this!

I'm not as sure about the second proposed fix in the residual compressor (though I am very prepared to be wrong :smile:).

Ideally, we want the decompression of the primary plus residual to be expressed as a bit-shift of the primary added with the residual (for efficient implementation in hardware). Let $d$ be the decompression constant for the primary. Then, if $d^\prime = d / (2^n)$ is the decompression constant for the residual with $n$ bits of precision, the the combination of primary $p$ and residual $r$ becomes

\begin{align}
d \cdot p + d^\prime \cdot r &= d \cdot p + \frac{d}{2^n}\cdot r \\
    &= \frac{d}{2^n} \left( \underbrace{2^n \cdot p}_{\text{bit shift}} + r \right)
\end{align}

Now, this approach does result in some loss in accuracy when $r$ hits its maximum possible value due to clamping and the asymmetry of signed integers, which is unfortunate. This happens if a value is slightly less than halfway between two levels in the primary encoding.

With the proposed change, the formula for decoding becomes

\begin{align}
d \cdot p + d^\prime \cdot r &= d \cdot p + \frac{d}{2^n-1}\cdot r \\
    &= \frac{d}{2^n-1} \left( (2^n - 1) \cdot p + r \right) \\
    &= \frac{d}{2^n-1} \left( \underbrace{2^n \cdot p}_{\text{bit shift}} - p + r \right)
\end{align}

which introduces another arithmetic operation in the decoding stage. However, this is just an integer subtraction so it's not the end of the world. Plus, distance computations with combined primary and residuals occupies a small fraction of end-to-end querying time anyways.

As I am writing this, I am also realizing that the proposed choice of decompression has slightly less distortion on the high-end of the residual encoding due to the slightly coarser step size. We could do even better with the original scaling constant if we were willing to tweak the primary encoding for troublesome values, but it is not clear to me if the loss in accuracy of the primary encoding is worth it.

I'm curious: have you seen a noticeable increase in accuracy with this change? If we did go this route, we'd need to tweak several other places (here, here, and here) in order to try end-to-end runs.

Thanks again for your feedback and interest, I'll let you know as soon as I have information on this CLA business.

hildebrandmw commented 7 months ago

Hello again, I have some updates now! We don't have automation set up for CLA signing, so for now please fill out the appropriate CLA form (individual or entity) and email it to svs-maintainers@intel.com (see #29).

As for the second fix with the decompression constant, there is a bit more going on here. I have been viewing two-level LVQ as a single level of scalar quantization (where the second level simply adds more bits of precision) rather than as two successive applications of scalar quantization (which is what I believe this PR is going for). I think there is merit to both approaches. However, correctly switching over the implementation to cascaded scalar quantization will require a more thorough audit of the code and potentially break previously saved LVQ datasets.

Some (very) quick internal testing suggests that there is minimal accuracy difference between the two approaches on our datasets, but if you have a use case that is suffering from the current implementation, I am definitely open to changing it. We may want to change it anyways for the sake of consistency. Perhaps others have an opinion on this? @aguerreb, @marianotepper, @ibhati :smile:

meiravgri commented 7 months ago

Hi! I'll send the CLA :)

Regarding the second fix. thanks for your detailed answer! I based my suggestion on the understanding that the quantization function is applied to the residual value, as described in the article.

The quantization process of the residual operates with min = -d/2, max = d/2, and the intervals number is always 2^B-1 hence d' = d/(2^B-1) (btw, if this fix is applicable, it should also be applied in vectors.h inScaledBiasedWithResidual::get

Also, out of curiosity, and correct me if I'm wrong, but you chose to use a signed encoder since: d' = d/(2^B-1) Where we subtracted (2^B -1) to shift the interval step to fit within the signed range. So using a signed type allows avoiding the addition and subtraction that sums up to zero.

I haven't tested it on a real dataset yet, I'm still in the process of understanding the main concepts.

hildebrandmw commented 7 months ago

Hello again! So you are right on both accounts :smile:. We think we should move towards treating two level LVQ as cascaded scalar quantization as described in the paper rather than the slightly different variation that is currently implemented (leading to the correct decompression scalar being $\frac{1}{2^B - 1}$). I think it is unlikely to affect performance or recall much, but it is probably worth doing.

This will mean we need to change the residual encoding to be unsigned rather than signed. The cancellation you observed only that allows signed numbers to work only applies when $\Delta^{\prime} = \frac{\Delta}{2^B}$ and not when $\Delta^{\prime} = \frac{\Delta}{2^B - 1}$.

How does this sound as a path forward: the replacement of numeric_limits::min with numeric_limits::lowest is very much correct (and I'll admit kind of an embarrassing mistake on my end). We would like to make the tweaks to the two-level LVQ reconstruction on our end as we have some internal tooling that will need to be tweaked and thus request this second change not be included in this PR. Is that acceptable?

I also want to reiterate how much I appreciate the time you have spent looking over the code and the questions you have brought up. Many thanks!

meiravgri commented 7 months ago

Regarding the alignment of the current implementation with the paper's approach, correct me if I'm wrong but except for the missing -1, it appears that the process of two-level LVQ is indeed in line with the methodology described in the paper.

The code snippet provided seems to follow the approach outlined in the paper: quantizing the residual using "global quantization" as opposed to LVQ applied on the vectors.

Compression requires computing r as r = x - mu - Q, and to store the interval r/delta' (ignoring the slight drift caused due to r being signed).

for (size_t i = 0; i < primary.size(); ++i) {
    float difference = static_cast<float>(data[i]) - primary.get(i);
    auto v = crunch(compressor, difference, min_s, max_s);
    cv.set(v, i);
}

The decompression process of r involves multiplying the stored value by delta'

  float get(size_t i) const {
      float primary = primary_.get(i);
      float residual_step = primary_.get_scale() / std::pow(2, Residual) ;
      float residual = residual_.get(i) * residual_step;
      return primary + residual;
  }

Could you please provide further clarification on how the current implementation deviates from the methodology described in the paper?

I'll revert the denominator change as suggested, and I look forward to seeing the enhancements you make.

My pleasure to contribute! Thank you again for your thorough review and for addressing my questions.

hildebrandmw commented 6 months ago

Certainly! You are correct in your summary. As it currently stands, the implementation in the code differs from the paper in two regards: the slightly different $\Delta^\prime$ that you observed and (once that is corrected), the small drift due to using a signed encoding. I'm in the process of checking for any differences observed in accuracy or recall on our data sets and will let you know the result.

As for expanding on the details of a full fix - the decompression routine

float get(size_t i) const {
    float primary = primary_.get(i);
    float residual_step = primary_.get_scale() / std::pow(2, Residual);
    float residual = residual_.get(i) * residual_step;
    return primary + residual;
}

is mainly used for testing and reference implementations and not used for actual distance computations. This is because it's really quite slow (especially considering that underneath this is the actual extraction of the bit encodings).

Instead, we use a bunch of SIMD machinery to fuse the bit extraction, application of scaling constants, and steps to perform a given distance computation into a single optimized kernel (that's what all the unpack_as, apply_step etc. functions are setting up for. While the various compute_quantized do invoke quite a bit of templated machinery, the compilers are able to optimize/inline through most of this and generate pretty decent assembly. There are some caveats to this approach:

GCC tends to give up on optimizing the SIMD code when the kernels are inlined, leading to sub-optimal code generation (hence the use of the SVS_GCC_NOINLINE tag - GCC does fine when compiling these functions separately).
Clang (on the other hand), absolutely loves inlining everything which some times results in larger binaries.
Compile times are hilariously long, though I'd personally take that over writing SIMD kernels by hand.

It's mainly these computation kernels that need a bit more attention to properly transition.

meiravgri commented 6 months ago

@hildebrandmw Can you please merge?

hildebrandmw commented 6 months ago

Sure! Thanks for your patience! #30 contains this PR and should preserve authorship. We have an internal repository with heavier CI resources (amongst other things) that we base this repository out of, so PR's on this end need to get routed through there (otherwise it becomes a nightmare to keep track of everything).

I'm still figuring out how all this tooling works. When we're satisfied that I didn't majorly mess anything up in #30, I will close this PR with a message pointing to the commit that merged it (again, assuming I did this correctly, it should keep you as the commit author as if we merged it directly). Apologies for this extra layer of indirection, but I'm trying my best within the constraints I have :smile:.

meiravgri commented 6 months ago

@hildebrandmw Got it. Thanks for the update! :)

hildebrandmw commented 6 months ago

This PR was merged in https://github.com/IntelLabs/ScalableVectorSearch/commit/a741a460d1c70aaabc10e8c0b047646f8252aec2.

Thanks again for your contribution!

intel / ScalableVectorSearch

Fix MinRange intialization #28