hibobmaster / matrix_chatgpt_bot

A simple matrix bot that supports image generation and chatting using ChatGPT, Langchain
https://matrix.to/#/#public:matrix.qqs.tw
MIT License
75 stars 13 forks source link

Some Feature Requests, STT, TTS, custom commands #49

Open mwnu opened 2 months ago

mwnu commented 2 months ago

Thank you, @hibobmaster , for developing this incredible project. Now I would like to ask if it is possible to add some features? Like text-to-speech, speech-to-text, and custom commands using different prompts and agents would be perfect. If that were the case, it would be perfect.

hibobmaster commented 2 months ago

STT: https://github.com/hibobmaster/matrix-stt-bot

It's not perfect but can somehow meet your need. different prompts and agents: https://github.com/hibobmaster/matrix_chatgpt_bot/wiki/Langchain-(flowise) related: https://github.com/hibobmaster/matrix_chatgpt_bot/issues/36

mwnu commented 2 months ago

matrix-stt-bot

The matrix-stt-bot is great, but it can only transcribe and does not support voice dialogue. Flowise is a bit complex, and unfortunately, it also does not support voice features.

hibobmaster commented 2 months ago

voice dialogue: you mean TTS funtion?

For custom commands, this is the entrypoint: https://github.com/hibobmaster/matrix_chatgpt_bot/blob/81543d561b46df4158892324172b5145e44f0e32/src/bot.py#L241 It's hard to maintain new commands at runtime.

So which custom commands do you need?

mwnu commented 2 months ago

voice dialogue: you mean TTS funtion?

For custom commands, this is the entrypoint:

https://github.com/hibobmaster/matrix_chatgpt_bot/blob/81543d561b46df4158892324172b5145e44f0e32/src/bot.py#L241

It's hard to maintain new commands at runtime. So which custom commands do you need?

Voice dialogue involves the use of speech-to-text (STT) and text-to-speech (TTS) technologies. A user speaks a message, and the robot responds in voice, initially generating a text message which is then converted to speech by TTS. Another common practice is to display a widget on the message entry that, when clicked, plays the message in text form. However, Matrix does not have this feature (although Matrix spec 1.4 includes MSC protocols for widgets, it seems no service has implemented this yet.). Thus, asking the robot to generate voice like how it handles images by quoting and tagging the robot makes it inconvenient, as voice dialogue is usually used when typing is not feasible. Therefore, outputting voice directly in a conversation is appropriate, and perhaps displaying two or three messages simultaneously would be clearer: one for the user's voice converted to text, one for the AI-generated text, and one for the TTS voice.

One can also envision a scenario where voice calls are used, similar to how chagpt and coilot operate on mobile apps, without the need for text interaction. The program automatically recognizes pauses in the user's tone (some third-party clients, like lobechat, have implemented this), and then responds with voice.

Of course, this would involve extensive coding work. I am eager to participate in this project, but unfortunately, I am not familiar with Python, which makes it difficult for me to understand the entire project. Maybe when I have time, I will study it more thoroughly.

mwnu commented 2 months ago

which custom commands do you need?

This's another idea: I envision a default dialogue model that can temporarily switch to other models using custom commands, such as !g35(gpt-3.5) or !c3g(claude-3-opus-20240229). Does this project implement models from providers other than OpenAI? By using a baseurl proxy, it is possible to support models from multiple vendors on a single platform (e.g., one-api), though I haven't tested this yet. This also includes temporarily switching to other agents, such as using !ss for web searches or !rag to consult one's own knowledge base.