michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip
https://michaelfeil.github.io/infinity/
MIT License
1.31k stars 97 forks source link

The embeddings are random When use multithreading requests #163

Closed xuwei6 closed 6 months ago

xuwei6 commented 6 months ago

when i use with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: to get embedding, i found the output accuracy is occasionally different given the same input

michaelfeil commented 6 months ago

Hey @xuwei6 ,

Important the computations are done in fp16 - the difference is np.abs(emb_1 - emb_2) < 0.001 is expected. This will not impact search quality.

Note: Don't use threading - please use async / await as in the docs. You might do so anyways, but I am confused about your threadin example.

xuwei6 commented 6 months ago

thank for your reply, i use multithreading to post request to infinity embed server. def embedding_post(query): response = requests.post(''http://xxxx:xxxx/embeddings'', json={"input": query, "model": EMBEDDING_MODEL_PATH}) return [d['embedding'] for d in response.json()['data']]

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: tasks = [] for q in queries: tasks.append(executor.submit(embedding_post, q)) for future in concurrent.futures.as_completed(tasks): res.append(future.result()) the embedding in res are random for the same query,

but when i use loop instead of multithreading, the embedding are unchanged for the same query,

michaelfeil commented 6 months ago

What do you mean by random? Do you mean the same vector, but non-deterministic?

They should have a l1-distance < 0.001?

xuwei6 commented 6 months ago

yes, you are right, l1-distance < 0.001 when i use FP16, and l1-distance < 0.0000001 when i export INFINITY_DISABLE_HALF='True', but if i not use multithreading, all distance deviation is 0.0

michaelfeil commented 6 months ago

Yeah, if you use a different gpu / cuda / amd you will see larger systems. On the same system, a deterministic input is deterministic. As you send 5 requests in parallel, they are batched in random order, which leads to a small randomized factor.

This is COMMON across all inference libaries, but thanks for raising the concern.