Closed SebG-js closed 1 month ago
Hi, Counting the rows of a collection/table is absolutely an important feature.
However, due to the architecture of PostgreSQL, getting the exact rows of a table is a costly operation. We could only scan the whole table to do this. For large tables from 1M
to 10M
rows, it would take more than 1 second for this query.
So we turn to pg_class
to get estimated rows. This value may be updated much later than the actual number of rows changed. So choose the approach that suits your use case.
We will implement a new function PGVectoRs.row_count(estimate: bool, filter: Filter)
to achieve the three usages:
Fast and acceptable in most cases
client.row_count(estimate=True)
Slow but accurate
client.row_count(estimate=False)
Conditions and can only be used with pricise rows count
client.row_count(estimate=False, filter=lambda r: r.meta.contains({"origin":"pgvecto.rs"}))
The feature PR is ready and will be posted for one week. Please let me know if you have any questions or comments.
Thank you for your explanations. For my current usecase, the table is small (~1000 docs). It could be bigger in a future. For information, I used the sdk with this temporary workaround:
I use only the doc count to show an increasing doc count on a frontend, When an admin adds docs to the database, he sees the doc count increases slightly.
Suppose we have inserted n = 100 000 docs (pdf files), split into m (=10) records (m is constant in this situation). We suppose that n and m are not always known. The expected records count is about n x m.
For example, we want to know :
The answer should be n. -> count records with unique "meta.src"
The records list length is about m in this example. It is possible to get this information with the following way: search (embedding = None, top_k= 100000)+ filter on meta.src = "programming in python"
As a conclusion, I think that you propose a good solution. Thanks !
In the python sdk, it could be nice to add the following feature in the file client.py:
count all elements in the pg table (all rows)
count all docs with unique metadata field (like filename or src or origin) A big doc can be split into several parts before computing the embedding vector.