lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Does not return Documents #8

Closed c-chaitanya closed 6 years ago

c-chaitanya commented 6 years ago

Hi lior,

This is one of the document I pushed along with several other, to Elasticsearch version 5.6.0 PUT /test/test/1 { "account_id": 2000000007, "question": " i have one number what is the process to add the second number?", "answer": " You cannot add any additional numbers ", "embedding_vector": "IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw==" }

The python(3.6) code I used for embedding vector field is as follows

def stringToBase64(string): # converts string to base64
    return base64.b64encode(bytes(string, 'utf-8'))

def decode_float_list(base64_string): # converts base64 string to array
    byte = base64.b64decode(base64_string)
    print("byte is "+str(byte))
    return np.frombuffer(byte, dtype=dbig).tolist()

def encode_array(arr): # converts array back to base64 string
    base64_str = base64.b64encode(np.array(arr).astype(dbig)).decode("utf-8")
    return base64_str

The same functions as given in readme except for the stringToBase64() function

stringToBase64(" i have one number what is the process to add the second number?")
returns b'IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw=='

decode_float_list("IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw==")
returns [1.4992215195544858e-152, 5.760354975542939e+228, 4.701095635989595e+180, 9.150375480313843e+199, 1.6743793267120413e+243, 1.9402257160408965e+227, 1.3332560325640997e+179, 1.8173709219006215e-152]

Now when i query using this array returned, it does not give me the original document POST /test/_search { "query": { "function_score": { "boost_mode": "replace", "script_score": { "script": { "inline": "binary_vector_score", "lang": "knn", "params": { "cosine": true, "field": "embedding_vector", "vector":[1.4992215195544858e-152, 5.760354975542939e+228, 4.701095635989595e+180, 9.150375480313843e+199, 1.6743793267120413e+243, 1.9402257160408965e+227, 1.3332560325640997e+179, 1.8173709219006215e-152] } } } } }, "size": 100 }

also for most of the strings it throws this error for decode_float_list() function an example string " but in case my free trial is over and i have no money to purchase yet can i still receive calls from people?"

Traceback (most recent call last):

  File "<ipython-input-39-619dfb271a22>", line 1, in <module>
    runfile('/home/robot/Desktop/base64.py', wdir='/home/robot/Desktop')

  File "/home/robot/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/home/robot/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/home/robot/Desktop/base64.py", line 28, in <module>
    b=decode_float_list(a)

  File "/home/robot/Desktop/base64.py", line 12, in decode_float_list
    return np.frombuffer(byte, dtype=dbig).tolist()

ValueError: buffer size must be a multiple of element size

My primary guess are that maybe im using the wrong encodings to encode string to base64, i dont get arrays for all strings(throws above error) Can you help me out here?

Regards

c-chaitanya commented 6 years ago

Hey, at last i figured out that using averaged word2vec was the best way to get vectors and using functions in the readme i could get the base 64 string of that vector.