This is one of the document I pushed along with several other, to Elasticsearch version 5.6.0
PUT /test/test/1
{
"account_id": 2000000007,
"question": " i have one number what is the process to add the second number?",
"answer": " You cannot add any additional numbers ",
"embedding_vector": "IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw=="
}
The python(3.6) code I used for embedding vector field is as follows
def stringToBase64(string): # converts string to base64
return base64.b64encode(bytes(string, 'utf-8'))
def decode_float_list(base64_string): # converts base64 string to array
byte = base64.b64decode(base64_string)
print("byte is "+str(byte))
return np.frombuffer(byte, dtype=dbig).tolist()
def encode_array(arr): # converts array back to base64 string
base64_str = base64.b64encode(np.array(arr).astype(dbig)).decode("utf-8")
return base64_str
The same functions as given in readme except for the stringToBase64() function
stringToBase64(" i have one number what is the process to add the second number?")
returns b'IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw=='
decode_float_list("IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw==")
returns [1.4992215195544858e-152, 5.760354975542939e+228, 4.701095635989595e+180, 9.150375480313843e+199, 1.6743793267120413e+243, 1.9402257160408965e+227, 1.3332560325640997e+179, 1.8173709219006215e-152]
Now when i query using this array returned, it does not give me the original document
POST /test/_search
{
"query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"inline": "binary_vector_score",
"lang": "knn",
"params": {
"cosine": true,
"field": "embedding_vector",
"vector":[1.4992215195544858e-152, 5.760354975542939e+228, 4.701095635989595e+180, 9.150375480313843e+199, 1.6743793267120413e+243, 1.9402257160408965e+227, 1.3332560325640997e+179, 1.8173709219006215e-152]
}
}
}
}
},
"size": 100
}
also for most of the strings it throws this error for decode_float_list() function
an example string " but in case my free trial is over and i have no money to purchase yet can i still receive calls from people?"
Traceback (most recent call last):
File "<ipython-input-39-619dfb271a22>", line 1, in <module>
runfile('/home/robot/Desktop/base64.py', wdir='/home/robot/Desktop')
File "/home/robot/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/home/robot/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/robot/Desktop/base64.py", line 28, in <module>
b=decode_float_list(a)
File "/home/robot/Desktop/base64.py", line 12, in decode_float_list
return np.frombuffer(byte, dtype=dbig).tolist()
ValueError: buffer size must be a multiple of element size
My primary guess are that maybe im using the wrong encodings to encode string to base64,
i dont get arrays for all strings(throws above error)
Can you help me out here?
Hey, at last i figured out that using averaged word2vec was the best way to get vectors and using functions in the readme i could get the base 64 string of that vector.
Hi lior,
This is one of the document I pushed along with several other, to Elasticsearch version 5.6.0 PUT /test/test/1 { "account_id": 2000000007, "question": " i have one number what is the process to add the second number?", "answer": " You cannot add any additional numbers ", "embedding_vector": "IGkgaGF2ZSBvbmUgbnVtYmVyIHdoYXQgaXMgdGhlIHByb2Nlc3MgdG8gYWRkIHRoZSBzZWNvbmQgbnVtYmVyPw==" }
The python(3.6) code I used for embedding vector field is as follows
The same functions as given in readme except for the stringToBase64() function
Now when i query using this array returned, it does not give me the original document POST /test/_search { "query": { "function_score": { "boost_mode": "replace", "script_score": { "script": { "inline": "binary_vector_score", "lang": "knn", "params": { "cosine": true, "field": "embedding_vector", "vector":[1.4992215195544858e-152, 5.760354975542939e+228, 4.701095635989595e+180, 9.150375480313843e+199, 1.6743793267120413e+243, 1.9402257160408965e+227, 1.3332560325640997e+179, 1.8173709219006215e-152] } } } } }, "size": 100 }
also for most of the strings it throws this error for decode_float_list() function an example string " but in case my free trial is over and i have no money to purchase yet can i still receive calls from people?"
My primary guess are that maybe im using the wrong encodings to encode string to base64, i dont get arrays for all strings(throws above error) Can you help me out here?
Regards