voxos-ai / bolna

End-to-end platform for building voice first multimodal agents
MIT License
396 stars 113 forks source link

Error receiving twillio response from the synthesizer #52

Closed shibasish closed 10 months ago

shibasish commented 10 months ago

This is our configuration:

{"assistant_name":"sample_agent","assistant_type":"other","tasks":[{"tools_config":{"llm_agent":{"streaming_model":"gpt-3.5-turbo-16k","classification_model":"gpt-4","max_tokens":100,"agent_flow_type":"streaming","use_fallback":false,"family":"openai","temperature":0.1,"request_json":false,"langchain_agent":false,"extraction_details":null,"extraction_json":null},"synthesizer":{"provider":"elevenlabs","provider_config":{"voice":"Myra","voice_id":"hNJixpY16PhIcW2MXauQ","model":""},"stream":true,"buffer_size":40,"audio_format":"mp3"},"transcriber":{"model":"deepgram","language":"en","stream":true,"sampling_rate":16000,"encoding":"linear16","endpointing":400},"input":{"provider":"twilio","format":"pcm"},"output":{"provider":"twilio","format":"pcm"},"api_tools":null},"toolchain":{"execution":"parallel","pipelines":[["transcriber","llm","synthesizer"]]},"task_type":"conversation"}]}

I am able to receive the response from synthesizer however, we get noise in the twillio response. We tried both stream:true and stream:false cases for synthesizer. We have stored the synthesizer output into mp3 format and the content is ok, however, the response is not received correctly by twillio.

marmikcfc commented 10 months ago

Hey @shibasish, Twilio would need the sampling rate to be 8000KHz. I made the change in the recent commit to explicitly override sample rate to 8000KHz in case of twilio. Can you take a pull and confirm if it works?

shibasish commented 10 months ago

Hi @marmikcfc yeah i tried both in the last code and the latest pull. I am still getting the same error unfortunate

marmikcfc commented 10 months ago

Noise would indicate some issue with the sampling rate. Can you confirm that the audio bytes are getting converted to 8000kHz in this line?

Also, can you try converting synthesizer format to pcm?

agokrani commented 10 months ago

Hi @marmikcfc, we have tried converting the synthesizer format to pcm that didn't help. To indicate the issue, we have saved the output mp3 or audio file inside the handle function of TwilioOutputHandler, it was fine until there. There is something happening in this part of the code try: audio_chunk = ws_data_packet.get('data') meta_info = ws_data_packet.get('meta_info') self.stream_sid = meta_info.get('stream_sid', None)

        try:
            if self.current_request_id == meta_info['request_id']:
                if len(audio_chunk) == 1:
                    audio_chunk += b'\x00'

            if audio_chunk and self.stream_sid and len(audio_chunk) != 1:

                audio = audioop.lin2ulaw(audio_chunk, 2)
                base64_audio = base64.b64encode(audio).decode("utf-8")
                message = {
                    'event': 'media',
                    'streamSid': self.stream_sid,
                    'media': {
                        'payload': base64_audio
                    }
                }

                await self.websocket.send_text(json.dumps(message))

As soon as the data goes through this part of code. Its just noise. Also, when doing streaming with eleven labs the chunks are not getting processed.

prateeksachan commented 10 months ago

Hi @agokrani , @shibasish I've just pushed a fix for this. https://github.com/bolna-ai/bolna/pull/53 Please let us know if it's working fine now.

In case it doesn't please feel free to mark a time https://calendly.com/bolna/30min. We can go over it on a call & would also love to know how you're using bolna. Thanks.

shibasish commented 10 months ago

HI @prateeksachan, it seems to work now. Thank you for the quick fix.

shibasish commented 10 months ago

Hi @prateeksachan the issue when receiving the speech data from elevenlabs is in pcm16000 sample rate, however when sending it to twillio the sample_rate should be downsized to 8000 sample rate. Not doing this was leading to a weird response in twillio. Also, in elevenlabs request with accept header is not required as you are already setting the output_format. The accept header doesn't seem to do anything. Please tryout the tts api in elevenlabs, I have fixed it locally, and will make a PR. You guys can review the changes. Would that be ok?

prateeksachan commented 10 months ago

@shibasish that would be awesome. We'll definitely review your PR and merge the changes. Thanks!