Open jkablan opened 2 months ago
Yes, I faced a similar situation since Ollama does not support concurrency. To overcome this issue, I started multiple containers of Ollama and distributed the embedding requests across the containers in a round-robin manner.
I'm having the same issue, ollama took more than 20 hours to generate embeddings using 'nomic-embed-text' on 190K texts. now I want to generate embeddings using llama3 on the same texts, but I'm worried it will take forever! Can we run it on a GPU or run it in batches/parallel processing, or any other idea to make the run faster? Did you resolve this issue or come up with an idea to resolve it?
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
n/a
Description
Calls to Ollama embeddings API are very slow (1000 to 2000ms) . GPU utilization is very low. Utilization spikes 30% - 100% once every second or two. This happens if I run main() or testOllamaSpeed() In the example code. This would suggest the problem is with Ollama. But If I run the following code which does not use any langchain imports each call completes in 200-300ms and GPU utilization hovers at a consistent 70-80%. The problem is even more pronounced if I use mxbai-embed-large with the example code taking 1000 to 2000ms per call and the code below taking ~50ms per call. VRAM usage is never above 4ish GB (~25% of my total VRAM).
For reference my environment is: Windows 11 12 Gen i9-1250HX 128GB RAM NVIDIA RTX A4500 Laptop 16GB VRAM Ollama 0.1.38
System Info
langchain==0.2.0 langchain-chroma==0.1.1 langchain-community==0.2.0 langchain-core==0.2.0 langchain-text-splitters==0.2.0