redis / redis-py

Redis Python client
MIT License
12.68k stars 2.53k forks source link

byte vector is incorrectly decoded as utf-8 string in ft result class #2275

Open AnneYang720 opened 2 years ago

AnneYang720 commented 2 years ago

Version:

$ pip3 show redis
Name: redis
Version: 4.3.4

Platform: Python 3.9.2 on Debian GNU/Linux 11

Description: The bytes is converted to string in the vector search results and there is an error in this conversion. The bytes including b'\x80' is converted to a wrong string.

Example Code

from redis import Redis
from redis.commands.search.field import VectorField
from redis.commands.search.query import Query

r = Redis(host='localhost',port=6379)
schema = (VectorField("v", "HNSW", {"TYPE": "FLOAT32", "DIM": 1, "DISTANCE_METRIC": "L2"}),)
r.ft().create_index(schema)

r.hset(f'{1}',mapping={'v':b'\x80\x00\x00\x00'})

q = Query("*=>[KNN 1 @v $vec AS vector_score]").dialect(2)
results = r.ft().search(q, query_params={"vec": b'\x80\x00\x00\x00'}).docs

for m in results:
    print(m.v)
    print('match emb =', bytes(m.v,'utf-8'))

The original bytes b'\x80\x00\x00\x00' is converted to string '\x00\x00\x00'.

Reason

# /redis/commands/search/result.py
dict(
    dict(
        zip(
            map(to_string, res[i + fields_offset][::2]),
            map(to_string, res[i + fields_offset][1::2]),
        )
    )
)

# /redis/commands/search/_util.py
def to_string(s):
    if isinstance(s, str):
        return s
    elif isinstance(s, bytes):
        return s.decode("utf-8", "ignore") # here! 
    else:
        return s
colibrisson commented 1 year ago

@AnneYang720 did you find a workaround?

kamyabzad commented 9 months ago

What about using "backslashreplace" mode instead of "ignore"?

gaoyichuan commented 8 months ago

@kamyabzad I think in this case, we should get the original bytes as result, rather than try any kind of unicode decoding? Since user may need to convert this back to a numpy array or float array.

I don't see a good solution or workaround under current search result parsing codebase though, maybe we need some ideas from the maintainers.