Closed xuwei6 closed 6 months ago
Hey @xuwei6 ,
Important
the computations are done in fp16 - the difference is np.abs(emb_1 - emb_2) < 0.001
is expected. This will not impact search quality.
Note:
Don't use threading - please use async / await
as in the docs. You might do so anyways, but I am confused about your threadin example.
thank for your reply, i use multithreading to post request to infinity embed server.
def embedding_post(query):
response = requests.post(''http://xxxx:xxxx/embeddings'', json={"input": query, "model": EMBEDDING_MODEL_PATH})
return [d['embedding'] for d in response.json()['data']]
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
tasks = []
for q in queries:
tasks.append(executor.submit(embedding_post, q))
for future in concurrent.futures.as_completed(tasks):
res.append(future.result())
the embedding in res are random for the same query,
but when i use loop instead of multithreading, the embedding are unchanged for the same query,
What do you mean by random? Do you mean the same vector, but non-deterministic?
They should have a l1-distance < 0.001?
yes, you are right, l1-distance < 0.001 when i use FP16, and l1-distance < 0.0000001 when i export INFINITY_DISABLE_HALF='True'
, but if i not use multithreading, all distance deviation is 0.0
Yeah, if you use a different gpu / cuda / amd you will see larger systems. On the same system, a deterministic input is deterministic. As you send 5 requests in parallel, they are batched in random order, which leads to a small randomized factor.
This is COMMON across all inference libaries, but thanks for raising the concern.
when i use
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
to get embedding, i found the output accuracy is occasionally different given the same input