Fix inconsistent handling of vector types in from_text method

li-xiu-qi commented 1 month ago

I encountered an error after modifying my code. The original code was:

from sqlalchemy import func, select, and_

def build_vector_search_query(db_model, query_vector, filter_conditions: list, offset: int, limit: int, threshold: float = None):
    columns_to_select = [col for col in db_model.__table__.columns if col.name != 'vector']
    query = select(db_model).add_columns(*columns_to_select)

    if filter_conditions:
        query = query.where(and_(*filter_conditions))

    similarity_score = db_model.vector.op('<=>')(query_vector)
    rank_position = func.row_number().over(order_by=similarity_score.asc()).label('rank_position')
    query = query.add_columns(rank_position).order_by(rank_position).offset(offset).limit(limit)

    if threshold is not None:
        query = query.where(similarity_score >= threshold)

    return query

I modified it to:

from sqlalchemy import func, select, and_

def build_vector_search_query(db_model, query_vector, filter_conditions: list, offset: int, limit: int, threshold: float = None):
    columns_to_select = [col for col in db_model.table.columns if col.name != 'vector']  
    query = select(db_model).add_columns(*columns_to_select)

    if filter_conditions:
        query = query.where(and_(*filter_conditions))

    similarity_score = db_model.vector.op('<=>')(query_vector).label('similarity_score')  # Changed label to 'similarity_score'
    rank_position = func.row_number().over(order_by=similarity_score.asc()).label('rank_position')  # Changed order_by to similarity_score
    query = query.add_columns(similarity_score, rank_position).order_by(rank_position).offset(offset).limit(limit)  # Added similarity_score to add_columns

    if threshold is not None:
        query = query.where(similarity_score >= threshold)  

    return query

After this change, I encountered an error. The error traceback pointed to the following method:

@classmethod
def from_text(cls, value):
    return cls([float(v) for v in value[1:-1].split(',')])

The error message was:

TypeError: 'float' object is not subscriptable

Upon debugging, I found that initially, a string vector was passed, but later, a float value was passed, causing the error. I modified the method to handle both cases:

@classmethod
def from_text(cls, value):
    if isinstance(value, float):
        return cls([value])
    elif isinstance(value, str):
        return cls([float(v) for v in value[1:-1].split(',')])

This resolved the error, and the similarity score was returned correctly. However, I am unsure of the underlying cause of this issue.

Proposed Solution

The issue seems to be related to the inconsistent types of value being passed to the from_text method. The method now handles both float and string inputs, ensuring compatibility. This fix should be sufficient for now, but further investigation into why the type inconsistency occurs might be necessary for a more robust solution.

Please let me know if you need any further details or adjustments.

ankane commented 1 month ago

Hi @li-xiu-qi, thanks for the PR. The error is due to using op('<=>') instead of op('<=>', return_type=Float). I'd recommend using the CosineDistance function instead.

(also, what's labeled as similarity_score is the distance, not the similarity, so the threshold logic should be reversed)

li-xiu-qi commented 1 month ago

Your opinion is valid, thank you!

pgvector / pgvector-python

Fix inconsistent handling of vector types in from_text method #93

Proposed Solution