Hi, thank you for your excellent work. As we know, in text-to-text models, we can perform Retrieval-Augmented Generation (RAG). For more clarification, I have my personal data in text format, but to make an assistant, the input could be either audio or text. I’d like to avoid converting audio to text for contextual retrieval. I have a couple of questions:
Is it possible to search documents using voice embeddings directly by passing in voice data?
Is it possible to provide contextual text as input to the model alongside an audio file?
Hi, thank you for your excellent work. As we know, in text-to-text models, we can perform Retrieval-Augmented Generation (RAG). For more clarification, I have my personal data in text format, but to make an assistant, the input could be either audio or text. I’d like to avoid converting audio to text for contextual retrieval. I have a couple of questions: