vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.8k stars 604 forks source link

Add support for hex representation for mixed tensors in queries #32231

Open jobergum opened 2 months ago

jobergum commented 2 months ago

We support hex format in document JSON but not for queries.

{
    "put": "id:doc:doc::1",
    "fields": {
        "text": "To transport goods on water, use a boat",
        "embedding": {
            "0": "3DE38E393E638E393EAAAAAB",
            "1": "3EE38E393F0E38E43F2AAAAB",
            "2": "3F471C723F638E393F800000"
        }
    }
}

Is valid, but attempting to send this format in queries will throw a 400 bad request

vespa query 'yql=select * from doc where true' 'ranking=full' 'input.query(qt)={"0":"3DE38E393E638E393EAAAAAB"}'
{ "errors": [
            {
                "code": 3,
                "summary": "Illegal query",
                "message": "Could not set 'ranking.features.query(qt)' to '{\"0\":\"3DE38E393E638E393EAAAAAB\"}': Could not parse '{\"0\":\"3DE38E393E638E393EAAAAAB\"}' as a tensor of type tensor<float>(querytoken{},v[3]): At value position 0: Expected a '[' but got '\"'"
            }
]}
jobergum commented 4 weeks ago

I still experience the same with [8.424.11]

arnej27959 commented 4 weeks ago

use 'input.query(qt)={"0":3DE38E393E638E393EAAAAAB}'

jobergum commented 3 weeks ago

It's IMHO unfortunate that one then needs one format for JSON feed and one string format without quotes for queries. When I have a dict<string,string> - now I need to write a custom routine to produce a string for the query and not use the JSON representation of the dict<string,string>.

jobergum commented 3 weeks ago

Snippet from a notebook


import struct
import torch
import numpy as np

def binarize_tensor(tensor: torch.Tensor) -> str:
    """
    Binarize a floating-point 1-d tensor by thresholding at zero 
    and packing the bits into bytes. Returns the hex str representation of the bytes.
    """
    if not tensor.is_floating_point():
        raise ValueError("Input tensor must be of floating-point type.")
    return np.packbits(np.where(tensor > 0, 1, 0), axis=0).astype(np.int8).tobytes().hex()

def tensor_to_hex_bfloat16(tensor: torch.Tensor) -> str:
    if not tensor.is_floating_point():
        raise ValueError("Input tensor must be of float32 type.")
    def float_to_bfloat16_hex(f: float) -> str:
        packed_float = struct.pack('=f', f)
        bfloat16_bits = struct.unpack('=H', packed_float[2:])[0]
        return format(bfloat16_bits, '04X')
    hex_list = [float_to_bfloat16_hex(float(val)) for val in tensor.flatten()]
    return "".join(hex_list)

async def get_vespa_response(
        embedding: torch.Tensor, 
        qid: str, 
        session: VespaAsync,
        depth=20,
        profile = "float-float") -> List[ScoredDoc]: 

    # The query tensor API does not support hex formats yet
    # so this format will throw a parse error
    float_embedding = {index: tensor_to_hex_bfloat16(vector) 
                       for index, vector in enumerate(embedding)}
    binary_embedding = {index: binarize_tensor(vector) 
                     for index, vector in enumerate(embedding)}    
    response: VespaQueryResponse = await session.query(
        yql="select id from pdf_page where true", # brute force search, rank all pages
        ranking=profile,
        hits=5,
        timeout=10,
        body={
            "input.query(qt)" : float_embedding,
            "input.query(qtb)" : binary_embedding,
            "ranking.rerankCount": depth
        }
    )
    assert response.is_successful()
    scored_docs = []

This will not work with the the custom tensor format, but works for feeding

vespa_docs = []

for row, embedding in zip(ds, embeddings):
    embedding_full = dict()
    embedding_binary = dict()
    # You can experiment with pooling if you want to reduce the number of embeddings
    #pooled_embedding = pool_embeddings(embedding, pool_factor=2) # reduce the number of embeddings by a factor of 2
    for j, emb in enumerate(embedding):
        embedding_full[j] = tensor_to_hex_bfloat16(emb)
        embedding_binary[j] = binarize_tensor(emb)
    vespa_doc = {
        "id": row['docId'],
        "embedding": embedding_full,
        "binary_embedding": embedding_binary
    }
    vespa_docs.append(vespa_doc)
arnej27959 commented 3 weeks ago

there are many differences between the JSON formats and the "literal form". we can try to smooth over some of these differences but there's no way to get rid of them all.

bratseth commented 3 weeks ago

Maybe we should support inputting tensors in JSON format somehow?

jobergum commented 3 weeks ago

I understand that not all tensor formats translate to something representable in JSON, but I do think that mixed tensors with one mapped dimension and one indexed dimension could. Now I need two functions, one for feed and one for queries.