Multi-User Chat Generation

kevinthedang / discord-ollama

Discord Bot that utilizes Ollama to interact with any Large Language Models to talk with users and allow them to host/create their own models.

Creative Commons Attribution 4.0 International

73 stars 8 forks source link

Multi-User Chat Generation #53

Closed kevinthedang closed 2 months ago

kevinthedang commented 5 months ago

Issue

When multiple people want to talk with the bot at the same time, it will not run them asynchronously. Refer to image below.
This applies to both streaming and non-streaming from the #52 PR and for both messaging styles.

Solution

Implement some kind of thing that allows the bot client to run things asynchronously.
Implement proxy server to handle multiple Ollama containers (refer to notes)

Notes

This will be problematic as multiple instances of Ollama might be required to make this work if the method does not include "streaming."
Streaming allows for the generation to come dynamically and while it finishes server-side, it can work on another person while completing another users response.
- This also needs to be handled as per this issue.
A "proxy server" might be needed to make this happen.
- Note: We will need to know # of Ollama containers on startup

kevinthedang commented 3 months ago

Looks like as of May, the developers made it possible to create this feature (refer below).

Another possibility of this was mentioned back on Jan. 31st is still possible and I believe was mentioned above. The use of proxies can be on the table if needed.

References

https://github.com/ollama/ollama/issues/358#issuecomment-2091482644
- Developer patch
https://github.com/ollama/ollama/issues/358#issuecomment-1918664301
- Proxies

kevinthedang commented 2 months ago

With Ollama v0.2.0, concurreny and parallel generation is possible for the bot.

https://github.com/ollama/ollama/releases/tag/v0.2.0

kevinthedang commented 2 months ago

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests

kevinthedang commented 2 months ago

Likely no implementation is needed, but that will need to be tested. Likely close this after #82 is resolved.

@JT2M0L3Y

kevinthedang commented 2 months ago

Looks like Concurrency works as intended outside the box.

Discord:

Logging of the two conversations generating simultaneously:

This can be closed with #82 now

kevinthedang commented 2 months ago

Something I did not read about initially when 0.2.0 was release but we might have to have some kind of implementation that allows user to select:

OLLAMA_MAX_LOADED_MODELS - How many models allowed to be active in a given time period.
OLLAMA_NUM_PARALLEL - How many concurrent requests per model

Might be an issue we should create as a new feature. Possibly done through Slash Commands?

Note: Concurrency is automatic as noted, but too lazy at the moment to know on maxes for it right now.

Thread Reference: Parallel Requests

@JT2M0L3Y