lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
GNU Affero General Public License v3.0
790 stars 57 forks source link

infer dimensions #88

Closed ezra-varady closed 1 year ago

ezra-varady commented 1 year ago

This should address #55, allowing postgres to automatically decide the dimension index of arrays of real and int. It preserves the user's ability to specify a dimension. When dims is left as the default lantern will now check the column's dimension and use that instead if it differs. I haven't tested against older versions of postgres, and I'm not familiar with the API, so this may need some additional work to be backwards compatible

ezra-varady commented 1 year ago

API appears to have been different in postgres 11, will look into it later this evening

ezra-varady commented 1 year ago

There's actually an issue with GetHnswIndexDimensions. If we infer the dimension it won't set the dims relopt and it will break scans. Unfortunately afaict while building the index the rd_options field is still null, and my various attempts to set it manually don't work, marking as draft for a bit while I add logic to fix this

EDIT: fixed this but I'm seeing odd results on the distance function regression tests these two diffs are exemplary but there are several others

@@ -102,13 +102,13 @@
     FROM small_world_l2
     ORDER BY vector <-> array[0,1,0] LIMIT 7
 ) v ORDER BY v.dist, v.id;
-                                                      QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
- Sort  (cost=0.70..0.72 rows=7 width=48)
+                                                     QUERY PLAN                                                     
+--------------------------------------------------------------------------------------------------------------------
+ Sort  (cost=14.36..14.38 rows=7 width=48)
    Sort Key: v.dist, v.id
-   ->  Subquery Scan on v  (cost=0.00..0.60 rows=7 width=48)
-         ->  Limit  (cost=0.00..0.53 rows=7 width=52)
-               ->  Index Scan using small_world_l2_vector_idx on small_world_l2  (cost=0.00..81.42 rows=1070 width=52)
+   ->  Subquery Scan on v  (cost=0.00..14.26 rows=7 width=48)
+         ->  Limit  (cost=0.00..14.19 rows=7 width=52)
+               ->  Index Scan using small_world_l2_vector_idx on small_world_l2  (cost=0.00..16.22 rows=8 width=52)
                      Order By: (vector <-> '{0,1,0}'::real[])
 (6 rows)

@@ -161,8 +161,8 @@
 -----+------
  010 | 0.00
  011 | 0.29
- 110 | 0.29
  111 | 0.42
+ 000 | 1.00
  001 | 1.00
  100 | 1.00
  101 | 1.00
Ngalstyan4 commented 1 year ago

Hi @ezra-varady,

This might have been a result of some issues on indexes with real[] column type. @dqii addressed those at #87 which was just merged to main. Could you rebase and see if the issues you mentioned above persist?