feat: push down where clause into vector index

VoVAllen commented 1 week ago

Ref: https://github.com/tensorchord/pgvecto.rs/issues/502

Early stop index iterator with where distance<0.99 clause

cutecutecat commented 1 week ago

Definition

As Index cond pushdown could only accept format like: [Indexed variable] [operator] [constant] = val < ROW('[0.5,0.5,0.5]'::vector, '<->', 0.99)

Example:

SELECT * FROM t 
WHERE val < ROW('[0.5,0.5,0.5]'::vector, '<->', 0.99) 
ORDER BY val <-> '[0.5,0.5,0.5]' limit 10;

Semantics: val < ROW('[0.5,0.5,0.5]'::vector, '<->', 0.99) equals val <->'[0.5,0.5,0.5]'::vector < 0.99

val <= ROW('[0.5,0.5,0.5]'::vector, '<=>', 0.99) equals val <=>'[0.5,0.5,0.5]'::vector <= 0.99

Operations

Validate:

source [0.5,0.5,0.5] in WHERE equals in ORDER BY, else won't push down
operator <->(L2) in WHERE equals to index opclass, else won't push down

Pushdown:

operator </<=
threshold 0.99

Execution:

graph LR
7[Index query] -->|select| 4[SELECT Result]
1[Where Clause] -->|not pushdown| 3[Filter]
3[Filter] -->|apply| 4[Index SCAN Result]
4[Index SCAN Result] -->|filter| 8[SELECT Result]

1[Where Clause] -->|pushdown| 2[Index Cond]
2[Index Cond] -->|early stop| 5[Index Query]
5[Index query] -->|select| 6[SELECT Result]

Experiment

Conclusion: The source of the original query statement is outside the random() data distribution, and HNSW has very low accuracy for these edge data.

If input vectors: $vectors(3) \in [0, 1]$ Bad: $source = [0, 0, 0]$ or $source = [-1, -1, 0]$ Good: $source = [0.5, 0.5, 0.5]$

The result of original query: $source(640) \in [-1, 1]$

filter	inserted	filtered vectors	searched / recall
`cos dis < 0.97`	0.45M	202 / 0.045%	3 / 1.4%
`cos dis < 0.98`	0.45M	957 / 0.2%	12 / 1.3%
`cos dis < 0.99`	0.45M	3981 / 0.8%	73 / 1.8%

filter	inserted	filtered vectors	searched / recall
`dot dis < -0.4`	0.45M	223 / 0.049%	5 / 2.2%
`dot dis < -0.2`	0.45M	1843 / 0.4%	20 / 1.1%
`dot dis < 0`	0.45M	9744 / 2.1%	114 / 1.2%

filter	inserted	filtered vectors	searched / recall
`l2 dis < 190`	0.45M	137 / 0.034%	8 / 5.8%
`l2 dis < 195`	0.45M	1418 / 0.3%	195 / 13.7%
`l2 dis < 200`	0.45M	8448 / 1.8%	1825 / 21.6%
`l2 dis < 205`	0.45M	33431 / 7.4%	17712 / 52.9%

The result of improved query: $source(640) = [0.5, 0.5, ..., 0.5]$

filter	inserted	filtered vectors	searched vectors / recall
`cos dis < 0.11`	0.45M	12 / 0.026%	10 / 83.3%
`cos dis < 0.115`	0.45M	423 / 0.94%	376 / 88.9%
`cos dis < 0.117`	0.45M	1261 / 2.80%	1232 / 97.7%
`cos dis < 0.12`	0.45M	5134 / 11.4%	5104 / 99.4%

filter	inserted	filtered vectors	searched / recall
`dot dis < -175`	0.45M	16 / 0.035%	16 / 100.0%
`dot dis < -173`	0.45M	96 / 0.21%	93 / 96.8%
`dot dis < -171`	0.45M	613 / 1.36%	586 / 95.5%
`dot dis < -170`	0.45M	1421 / 3.15%	1361 / 95.7%
`dot dis < 168`	0.45M	6477 / 14.39%	6130 / 94.6%

filter	inserted	filtered vectors	searched vectors
`l2 dis < 46`	0.45M	12 / 0.026%	7 / 58.3%
`l2 dis < 47`	0.45M	143 / 0.31%	56 / 39.1%
`l2 dis < 48`	0.45M	951 / 2.11%	851 / 89.4%
`l2 dis < 49`	0.45M	4696 / 10.43%	4621 / 98.4%
`l2 dis < 50`	0.45M	17033 / 37.85%	16893 / 99.2%

mertalev commented 3 days ago

Thanks for working on this! Out of curiosity, does this handle relaxed monotonicity? Some time ago I experimented with a plpgsql function that iterates through results and stops at the first distance above the threshold, but I noticed the recall was lower because there can be tuples below the threshold after that.

VoVAllen commented 2 days ago

@mertalev Ideally this can be done with relaxed monotonicity, and vbase original paper covered this scenario with high recall at 98%. We're trying to replicate the results now

tensorchord / pgvecto.rs