tensorchord / pgvecto.rs

Scalable, Low-latency and Hybrid-enabled Vector Search in Postgres. Revolutionize Vector Search, not Database.
https://docs.pgvecto.rs/getting-started/overview.html
Apache License 2.0
1.53k stars 60 forks source link

feat: push down where clause into vector index #503

Open VoVAllen opened 1 week ago

VoVAllen commented 1 week ago

Ref: https://github.com/tensorchord/pgvecto.rs/issues/502

Early stop index iterator with where distance<0.99 clause

cutecutecat commented 1 week ago

Definition

As Index cond pushdown could only accept format like: [Indexed variable] [operator] [constant] = val < ROW('[0.5,0.5,0.5]'::vector, '<->', 0.99)

Example:

SELECT * FROM t 
WHERE val < ROW('[0.5,0.5,0.5]'::vector, '<->', 0.99) 
ORDER BY val <-> '[0.5,0.5,0.5]' limit 10;

Semantics: val < ROW('[0.5,0.5,0.5]'::vector, '<->', 0.99) equals val <->'[0.5,0.5,0.5]'::vector < 0.99

val <= ROW('[0.5,0.5,0.5]'::vector, '<=>', 0.99) equals val <=>'[0.5,0.5,0.5]'::vector <= 0.99

Operations

Validate:

Pushdown:

Execution:

graph LR
7[Index query] -->|select| 4[SELECT Result]
1[Where Clause] -->|not pushdown| 3[Filter]
3[Filter] -->|apply| 4[Index SCAN Result]
4[Index SCAN Result] -->|filter| 8[SELECT Result]

1[Where Clause] -->|pushdown| 2[Index Cond]
2[Index Cond] -->|early stop| 5[Index Query]
5[Index query] -->|select| 6[SELECT Result]

Experiment

Conclusion: The source of the original query statement is outside the random() data distribution, and HNSW has very low accuracy for these edge data.

If input vectors: $vectors(3) \in [0, 1]$ Bad: $source = [0, 0, 0]$ or $source = [-1, -1, 0]$ Good: $source = [0.5, 0.5, 0.5]$

The result of original query: $source(640) \in [-1, 1]$

filter inserted filtered vectors searched / recall
cos dis < 0.97 0.45M 202 / 0.045% 3 / 1.4%
cos dis < 0.98 0.45M 957 / 0.2% 12 / 1.3%
cos dis < 0.99 0.45M 3981 / 0.8% 73 / 1.8%
filter inserted filtered vectors searched / recall
dot dis < -0.4 0.45M 223 / 0.049% 5 / 2.2%
dot dis < -0.2 0.45M 1843 / 0.4% 20 / 1.1%
dot dis < 0 0.45M 9744 / 2.1% 114 / 1.2%
filter inserted filtered vectors searched / recall
l2 dis < 190 0.45M 137 / 0.034% 8 / 5.8%
l2 dis < 195 0.45M 1418 / 0.3% 195 / 13.7%
l2 dis < 200 0.45M 8448 / 1.8% 1825 / 21.6%
l2 dis < 205 0.45M 33431 / 7.4% 17712 / 52.9%

The result of improved query: $source(640) = [0.5, 0.5, ..., 0.5]$

filter inserted filtered vectors searched vectors / recall
cos dis < 0.11 0.45M 12 / 0.026% 10 / 83.3%
cos dis < 0.115 0.45M 423 / 0.94% 376 / 88.9%
cos dis < 0.117 0.45M 1261 / 2.80% 1232 / 97.7%
cos dis < 0.12 0.45M 5134 / 11.4% 5104 / 99.4%
filter inserted filtered vectors searched / recall
dot dis < -175 0.45M 16 / 0.035% 16 / 100.0%
dot dis < -173 0.45M 96 / 0.21% 93 / 96.8%
dot dis < -171 0.45M 613 / 1.36% 586 / 95.5%
dot dis < -170 0.45M 1421 / 3.15% 1361 / 95.7%
dot dis < 168 0.45M 6477 / 14.39% 6130 / 94.6%
filter inserted filtered vectors searched vectors
l2 dis < 46 0.45M 12 / 0.026% 7 / 58.3%
l2 dis < 47 0.45M 143 / 0.31% 56 / 39.1%
l2 dis < 48 0.45M 951 / 2.11% 851 / 89.4%
l2 dis < 49 0.45M 4696 / 10.43% 4621 / 98.4%
l2 dis < 50 0.45M 17033 / 37.85% 16893 / 99.2%
mertalev commented 3 days ago

Thanks for working on this! Out of curiosity, does this handle relaxed monotonicity? Some time ago I experimented with a plpgsql function that iterates through results and stops at the first distance above the threshold, but I noticed the recall was lower because there can be tuples below the threshold after that.

VoVAllen commented 2 days ago

@mertalev Ideally this can be done with relaxed monotonicity, and vbase original paper covered this scenario with high recall at 98%. We're trying to replicate the results now