The main difference is that quantized vector in neural-network case has two parameters: A: f32 and Z: u8 and the forward conversion looks like this: $X = A (X_q - Z)$ where $X_q \in [0..256) \cap \mathbb{Z}$.
Our (and qdrants) approach is slightly different: we store two f32 parameters: A: f32 and S: f32 with following forward conversion: $X = A X_q + S$.
The reason we choose this approach is that it's more generic and will represent skewed distributions not covering zero way better than first approach.
The layout for int8-quantized vectors are the following:
[data[0] as u8] [data[1] as u8] ... [data[dims - 1] as u8] [_ as u8; alignment_padding]*
[alpha as f32] [shift as f32] [padding as u8] [trailing_bytes as u8] [4 as u8]
every data byte represents single quantized vector component ($X_q$)
"alignment_padding" has size from 0 to 3 bytes in order to pad content to multiple of 4 = sizeof(float)
"trailing_bytes" byte specify amount of bytes in the "alignment_padding"
last 'type'-byte is mandatory for float8 vectors and equals to 4
Changes
Introduce FLOAT8 / F8_BLOB column type
Implement vector8 conversion function
Implement all necessary operations for vector8: conversions from/to other types, cosine distance, L2 distance
Add support for float8 neighbors compression in vector index
Expose vector_distance_l2 function
It's hard to validate it without having as sqlite function
Context
We want to support more compressed vector representations (not only
1bit
) - so this PR introduces support forINT8
quantization.The quantization we implemented differs from usual quantization performed in the neural network. In our case - implementation is similar to what
qdrant
does in simple case without quantiles (see https://github.com/qdrant/quantization/blob/0caf67d96f022a792bda2e41fa878ba1e113113f/quantization/src/encoded_vectors_u8.rs#L34 or https://qdrant.tech/articles/scalar-quantization/).The main difference is that quantized vector in neural-network case has two parameters:
A: f32
andZ: u8
and the forward conversion looks like this: $X = A (X_q - Z)$ where $X_q \in [0..256) \cap \mathbb{Z}$.Our (and
qdrant
s) approach is slightly different: we store twof32
parameters:A: f32
andS: f32
with following forward conversion: $X = A X_q + S$.The reason we choose this approach is that it's more generic and will represent skewed distributions not covering zero way better than first approach.
The layout for int8-quantized vectors are the following:
Changes
FLOAT8
/F8_BLOB
column typevector8
conversion functionvector8
: conversions from/to other types, cosine distance,L2
distancefloat8
neighbors compression in vector indexvector_distance_l2
function