new features [sdk] : count docs, unique docs

SebG-js commented 1 month ago

In the python sdk, it could be nice to add the following feature in the file client.py:

count all elements in the pg table (all rows)
count all docs with unique metadata field (like filename or src or origin) A big doc can be split into several parts before computing the embedding vector.

cutecutecat commented 1 month ago

Hi, Counting the rows of a collection/table is absolutely an important feature.

However, due to the architecture of PostgreSQL, getting the exact rows of a table is a costly operation. We could only scan the whole table to do this. For large tables from 1M to 10M rows, it would take more than 1 second for this query.

So we turn to pg_class to get estimated rows. This value may be updated much later than the actual number of rows changed. So choose the approach that suits your use case.

We will implement a new function PGVectoRs.row_count(estimate: bool, filter: Filter) to achieve the three usages:

Get estimated number of rows

Fast and acceptable in most cases client.row_count(estimate=True)

Get accurate row count

Slow but accurate client.row_count(estimate=False)

Condition row count

Conditions and can only be used with pricise rows count client.row_count(estimate=False, filter=lambda r: r.meta.contains({"origin":"pgvecto.rs"}))

The feature PR is ready and will be posted for one week. Please let me know if you have any questions or comments.

SebG-js commented 1 month ago

Thank you for your explanations. For my current usecase, the table is small (~1000 docs). It could be bigger in a future. For information, I used the sdk with this temporary workaround:

search all records with embedding = None

I use only the doc count to show an increasing doc count on a frontend, When an admin adds docs to the database, he sees the doc count increases slightly.

Suppose we have inserted n = 100 000 docs (pdf files), split into m (=10) records (m is constant in this situation). We suppose that n and m are not always known. The expected records count is about n x m.

For example, we want to know :

how many original docs (called src) have been saved ?

The answer should be n. -> count records with unique "meta.src"

As regards a particular src document "programming in python", we want do get all records/splits. [It is a new need, not exactly a conditional count ]

The records list length is about m in this example. It is possible to get this information with the following way: search (embedding = None, top_k= 100000)+ filter on meta.src = "programming in python"

As a conclusion, I think that you propose a good solution. Thanks !

tensorchord / pgvecto.rs-py