[Question]: RAG apps in other languages than english

stephanedebove commented 11 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hi everyone,

this is not a question related to Llama-index specifically, but I think you’re the most qualified to answer with your experience in RAG apps.

Although I have now read a lot of documentation and seen a lot of videos about building RAG apps, I haven’t seen much about building RAG apps in other languages than english (specifically french in my case).

I have seen that some embedding models are "multilingual", and the best one on the HuggingFace leaderboard is currently "multilingual E5 large". Does anyone have experience about using it? Does it make a big difference compared to non-multilingual models?

And do you know if some LLMs used for response synthesis are better than others for dealing with non-english languages? This is never mentioned in their doc, but if we have multilingual models for embeddings, I don’t see why we wouldn’t have multilingual models for response synthesis too.

Do you know any good resource talking about this subject, or do you have experience in this?

dosubot[bot] commented 11 months ago

🤖

Hello @stephanedebove,

Nice to meet you! I'm Dosu, a bot here to help you while we wait for a human maintainer. I can assist with solving bugs, answering questions, and even guide you to become a contributor. Don't hesitate to ask anything you need!

Based on the information available in the LlamaIndex repository, it does support integration with multilingual embedding models. This is evident from the import statement where it imports HuggingFaceBgeEmbeddings and HuggingFaceEmbeddings from langchain.embeddings. Hugging Face is known for providing multilingual models, which suggests that LlamaIndex can work with multilingual embedding models.

Here is the supporting evidence from the context:

from langchain.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings

Sources:

llama_index/bridge/langchain.py

As for your question about the performance of multilingual models compared to non-multilingual models, it can vary depending on the specific use case and the languages involved. Generally, multilingual models are trained on data from multiple languages, which can help them understand the semantic meaning of text across different languages. However, they might not perform as well as monolingual models for specific languages, especially for languages that are underrepresented in the training data.

Regarding your question about LLMs for response synthesis in non-English languages, it's a valid point. While I don't have specific information about this in the context of LlamaIndex, it's generally true that some models might perform better than others for different languages. This could be due to the training data used, the architecture of the model, or other factors.

Unfortunately, I wasn't able to find specific resources or documentation about this topic in the context of LlamaIndex. However, I would recommend looking into the documentation of the specific models you're interested in, as they might provide more information about their performance in different languages.

I hope this helps! If you have any other questions, feel free to ask.

Sources

llama_index/bridge/langchain.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dinonovak commented 10 months ago

Following the same topic, I am having the same struggles so would be nice to get some feedback from real cases. I am also wondering on experience with difference models. I found the "intfloat/multilingual-e5-large" as most promising multi language model, but anyone managed to get it working with lama-index?

dosubot[bot] commented 7 months ago

Hi, @stephanedebove,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The issue you opened discusses building RAG apps in languages other than English, particularly French, and inquires about the performance of certain LLMs for non-English languages. I have provided information on the support for multilingual embedding models in the LlamaIndex repository and the potential performance variations of multilingual models. Additionally, another user, dinonovak, has expressed similar struggles and is seeking feedback on experiences with different models, specifically mentioning the "intfloat/multilingual-e5-large" model.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!

last-Programmer commented 5 months ago

I am also facing this issue i am using ollamaembedding with nomic-embed-text and with langchain rag i am getting reponse in portuguese. but when i try with llama index rag it always responds in english. i tried changing prompt templates to include like "you always answer in portuguese" it did not help either.

run-llama / llama_index