Crash on indexing with 0.2.1 docker container

vade commented 1 month ago

Hi friends. Firstly, thanks for this project. the VBASE filtered + semantic search looks really incredible and we are eager to get PGVecto.rs working for our use case.

We are testing with a facated/filtered search synthetic data set using the public 0.2.1 docker container

We create 3 tables, a collection, which has documents, which has chunks, which contain the embeddings. We build roughly 15m embeddings across close to 2500 documents across roughly 10 collections. Our embeddings are random for now.

Without indexing, we are able to search using VBASE and our DB seems to run, but queries are slow as one might imagine.

Our Environment:

Docker Desktop on a M2 Mac Pro

Running tensorchord/pgvecto-rs:pg16-v0.2.1

Attempting to index using:

CREATE INDEX ON chunk USING vectors (embedding vector_l2_ops);

runs for a bit, but eventually crashes with:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

in the PSQL client

and in logs:

2024-05-24 13:23:33 2024-05-24 17:23:33.684 UTC [1] LOG:  background worker "vectors" (PID 164) was terminated by signal 9: Killed
2024-05-24 13:23:33 2024-05-24 17:23:33.689 UTC [1] LOG:  terminating any other active server processes
2024-05-24 13:23:33 2024-05-24 17:23:33.703 UTC [1] LOG:  all server processes terminated; reinitializing
2024-05-24 13:23:33 2024-05-24 17:23:33.769 UTC [195] LOG:  database system was interrupted; last known up at 2024-05-24 17:22:14 UTC
2024-05-24 13:23:33 2024-05-24 17:23:33.771 UTC [198] FATAL:  the database system is in recovery mode
2024-05-24 13:23:33 2024-05-24 17:23:33.973 UTC [195] LOG:  database system was not properly shut down; automatic recovery in progress
2024-05-24 13:23:33 2024-05-24 17:23:33.976 UTC [195] LOG:  redo starts at 11/A040E8E0
2024-05-24 13:23:34 2024-05-24 17:23:34.080 UTC [195] LOG:  invalid record length at 11/A45680E8: expected at least 24, got 0
2024-05-24 13:23:34 2024-05-24 17:23:34.080 UTC [195] LOG:  redo done at 11/A45680B8 system usage: CPU: user: 0.02 s, system: 0.07 s, elapsed: 0.10 s
2024-05-24 13:23:34 2024-05-24 17:23:34.083 UTC [196] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-05-24 13:23:34 2024-05-24 17:23:34.203 UTC [196] LOG:  checkpoint complete: wrote 9208 buffers (56.2%); 0 WAL file(s) added, 4 removed, 0 recycled; write=0.090 s, sync=0.019 s, total=0.121 s; sync files=9, longest=0.015 s, average=0.003 s; distance=66918 kB, estimate=66918 kB; lsn=11/A45680E8, redo lsn=11/A45680E8
2024-05-24 13:23:34 2024-05-24 17:23:34.211 UTC [1] LOG:  database system is ready to accept connections

We looked at #409, and tried to use the REINDEX command, but also crashed.

Note that this is a fresh container/ volume with no upgrading from any version, with fresh tables and fresh data.

Thank you for any ideas

vade commented 1 month ago

Reading in #409,

Im also trying to dump the pg_data/pg_vector indexes manually on disk, and REINDEX database;

which seemed to work.

For 15m vectors prior to indexing, Explain analyze on a faceted search took roughly 23 seconds.

Post index, it took 6 seconds.

vade commented 1 month ago

Im not going to close, only because there might be some interesting data in here to debug why the first index pass fails.

vade commented 1 month ago

Interesting. It seems like reindexing actually crashes but it happens in the background.

VoVAllen commented 1 month ago

What's your hardware? How much memory do you have?

VoVAllen commented 1 month ago

There might be some part already crashed. Can you try it with a fresh new database? Or manually delete all files under pgdata/pgvecto_rs and run REINDEX?

vade commented 1 month ago

Hi There!

Im running Docker on an M2 Mac Pro with 32 Gb Ram. Docker has 5 CPUs and 20 GB allocated.

I was able to run tensorchord/pgvecto-rs:pg16-v0.3.0-alpha.1 and while it seems to build the index from the get go, Its not obvious to me if the index is being used in our queries using EXPLAIN ANALYZE.

Manually deleting that folder and running REINDEX did work for 0.2.1, but it seems unnecessary for 0.3.0

Question: After running create index, and it returns / completed, should I expect higher than idle CPU usage on the Postgres container? I note that it seems like Indexing is still running.

Thanks for any insight @VoVAllen

VoVAllen commented 1 month ago

The behavior is changed between 0.2.1 to 0.3.0. In 0.2.1, the hnsw index is constructed asynchronously. Therefore when create index is finished, the query is done by a brute force scan at the beginning. And the real hnsw is constructed asynchronously in the back threads. Until the construction is finished, it will use the hnsw index and you'll see the query is much faster.

Question: After running create index, and it returns / completed, should I expect higher than idle CPU usage on the Postgres container? I note that it seems like Indexing is still running.

Yes, it's still running in the background process.

In 0.3.0, we decide to let create index finish when the real index is constructed. So you'll see it took much longer time for create index, but the query will use index directly after that.

VoVAllen commented 1 month ago

0.2.1 is a stable version. You can use SELECT * FROM pg_vector_index_stat; to check whether the real index is finished.

VoVAllen commented 1 month ago

What's the error you met on 0.3?

vade commented 1 month ago

Thank you @VoVAllen for the information about the differences in 0.3.0 and 0.2.1 - im happy testing on the Alpha as for now we are able to be flexible.

Right now, with 0.3.0 im not sure im seeing performance I expect, but moving to 0.3.0 from 0.2.1 has removed any crashing or disconnections from our PSQL client, which is awesome!

vade commented 1 month ago

Also, re performance, im not implying PGVecto.rs is slow, mostly that we are trying to find settings that work for our expected load. I suspect this can be closed as the main crashing issue is resolved, and most of my concerns are more suitable for Discord conversation / educating me on expected performance.

Thank you again!

gaocegege commented 1 month ago

Also, re performance, im not implying PGVecto.rs is slow, mostly that we are trying to find settings that work for our expected load. I suspect this can be closed as the main crashing issue is resolved, and most of my concerns are more suitable for Discord conversation / educating me on expected performance.

Thank you again!

Welcome questions about the performance things in Discord!

tensorchord / pgvecto.rs

Crash on indexing with 0.2.1 docker container #488