timescale / python-vector

https://timescale.github.io/python-vector/
Apache License 2.0
21 stars 1 forks source link

columns are hard coded. #27

Open srinivas-gurani opened 3 months ago

srinivas-gurani commented 3 months ago

https://github.com/timescale/python-vector/blob/34e51abff2f401f0e8aea71d5537b193a6fe34cb/timescale_vector/client.py#L768

    query = '''
    SELECT
        id, metadata, contents, embedding, {distance} as distance
    FROM
       {table_name}
    WHERE 
       {where}
    {order_by_clause}
    LIMIT {limit}
    '''.format(distance=distance, order_by_clause=order_by_clause, where=where, table_name=self.table_name, limit=limit)
    return (query, params)

Can we select which column we need? or pass different column names?

cevian commented 3 months ago

@srinivas-gurani what's the use-case here? Are all the same columns present and just have different names or is there a different schema altogether?

The library creates the tables as well and assumes a certain layout of the tables. It's easy to make the names configurable but I am wondering if you actually need some deeper changes.

lucasgadams commented 1 month ago

@cevian I agree here with @srinivas-gurani, it would be better if these things were a lot more customizable. This library assuming a layout of the tables it operates on honestly makes it pretty unusable for many real world scenarios. It would be great if the different parts were more modularized so people could fit them into their existing applications.

lucasgadams commented 1 month ago

I think this library overall is taking a much too high level approach that while might make it easy for some people, hurts its usability in a production dev environment. I think overall it would be nice if it provided a more isolated minimal and clean interface for the relevant things for vectorscale, which is mainly around creating the specific indexes and creating queries with some tuning nobs for searching. It shouldn't include things like adding where filters, predicates, table schemas, ect. I think a nice library to look to is python pgvector, which provides just the basics and additionally some nice interfaces for the common python DB libraries (sqlalchemy, asyncpg, ect).

I'd like to use this library mainly so I know what the different parameters are of the indexes, and make sure I know if any of those change or there are enhancements, but I think instead I will just copy and paste some code because unfortunately psycopg2 is a pain in the ass to install on a mac and I would not like to introduce that dependency. Our current setup uses pgvector asyncpg and sqlalchemy and have had success with that.

I do think the code around automatically adding new embeddings and general management of embedding tables is nice, but if it is built upon the assumption that we are structuring our tables in a specific way it is not going to be useful. Just my 2 cents.