Open qinst64 opened 3 months ago
The HTTP standard does not support compression for requests because it would require a pre-request to identify if the server is capable of decompressing it. When using APIs, compression support can be assumed, but this would require being implemented in the API itself first. The Ollama API is written in Go which I am not familiar with so I can't confirm for certain that it doesn't already support compressed requests, but I doubt it.
Compressing responses would be much easier to implement (at least when the response isn't being streamed) but, again, this would require changes to the Ollama API rather than ollama-js. I tested this and it does not seem that the Ollama API currently compresses responses. However, there might be specific circumstances where it does, I'm not sure.
As for whether implementing it would be a good idea, it's likely that generation speed is going to have a much greater effect on the overall speed than the request speed. Even with an upload speed of just 1Mb/s, it would take about a second to send a prompt which occupies the entirety of Llama 3.1's context window. Even when running on hardware dedicated to AI (e.g: Groq), generating a response to a request of that size takes much longer than a second, and Ollama is intended for consumer hardware. So, while it could be beneficial, I don't think it's really a concern at the moment.
I have a very long prompt, and ollama is in a remote server. While sending request through http using ollama-js, is compression (i.e. gzip ) already applied so that speed is optimal?