timescale / pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
PostgreSQL License
1.38k stars 58 forks source link

Missing/Incorrect Index Statistics #132

Open abdelfattahradwan opened 2 months ago

abdelfattahradwan commented 2 months ago

I noticed that when using pgvectorscale's indices, they never report an idx_scans number greater than zero.

For example, using the following table:

create table snippets
(
    id          integer                  not null primary key generated always as identity,
    public_id   uuid                     not null unique,
    name        text                     not null,
    description text                     not null,
    content     text                     not null,
    embedding   vector(1536)             not null,
    created_at  timestamp with time zone not null default now(),
    modified_at timestamp with time zone not null default now()
);

create index on snippets (name asc);

create index on snippets using diskann (embedding);

create index on snippets (created_at);

create index on snippets (modified_at);

And running the following query:

explain analyse
select snippets.public_id,
       snippets.name,
       snippets.description,
       snippets.content,
       snippets.created_at,
       snippets.modified_at,
       (snippets.embedding <=> '[-0.03248204,0.00062958454,-0.021850731,...]'::vector(1536)) as distance
from snippets
order by distance
limit 10;

Outputs the following query plan:

Limit  (cost=2.21..4.17 rows=10 width=1133) (actual time=0.248..0.370 rows=10 loops=1)
  ->  Index Scan using snippets_embedding_idx on snippets  (cost=2.21..6.91 rows=24 width=1133) (actual time=0.247..0.368 rows=10 loops=1)
"        Order By: (embedding <=> '[-0.03248204,0.00062958454,-0.021850731,...]'::vector)"
Planning Time: 0.170 ms
Execution Time: 0.390 ms

That tells me that the index is being used. But when I run the following query:

select *
from pg_stat_user_indexes
where indexrelname = 'snippets_embedding_idx'
order by idx_scan desc;
I get the following result: relid indexrelid schemaname relname indexrelname idx_scan last_idx_scan idx_tup_read idx_tup_fetch
21333 21344 public snippets snippets_embedding_idx 0 null 5269 5269

That, contradictory to the previously mentioned query plan, indicates that the index has never been used.

I observe this behaviour in all my database instances. I even spun up a new container running timescale/timescaledb-ha:pg16 and observed the same behaviour.

cevian commented 2 months ago

Thanks we'll look into it. Seems to be a bug