[Question]: Can I create the RAG for Urdu dataset using the same pattern as for English dateset. And I am using Openai latest multilanguage model for embeddings

USTAADCOM commented 1 month ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I want to create RAG on Urdu dataset. I am already using the Llama index for English documents RAG using build_sentence_window_index.

def create_db_for_sentence_window_retrieval(): """ doc string """ path = "data" global query_engine sentence_index = build_sentence_window_index( path, llm, embed_model = embed_model, # text-embedding-3-large save_dir = PERSIST_DIR_WINDOW_RETRIEVAL ) query_engine = get_sentence_window_query_engine(qa_template, sentence_index, rerank_top_n = 3)

dosubot[bot] commented 1 month ago

Hey @USTAADCOM, I'm here to assist you with any questions or issues you may have. Feel free to ask for help!

Yes, you can use the same pattern to create the Retrieval-Augmented Generation (RAG) system for the Urdu dataset using the build_sentence_window_index function and OpenAI's latest multilingual model for embeddings. The provided code demonstrates how to build a RAG system using the SentenceSplitter for sentence-level parsing and OpenAI's embedding model. You can adapt this pattern for your Urdu dataset by ensuring that the OpenAI model you use supports Urdu.

Here is the adapted code:

from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
    ]
)

# build the index
loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()
nodes = pipeline.run(documents=documents, show_progress=True)

# models
llm = OpenAI()
embed_model = OpenAIEmbedding()

# build RAG
index = VectorStoreIndex(nodes=nodes, embed_model=embed_model)
retriever = index.as_retriever()

if __name__ == "__main__":
    from llama_index.networks.contributor.retriever import ContributorRetrieverService
    import uvicorn

    retriever_service = ContributorRetrieverService.from_config_file(
        ".env.contributor.service", retriever=retriever
    )
    retriever_app = retriever_service.app

    # can add own endpoints or security to app
    # @app.get("...")
    # async def custom_endpoint_logic():
    #   ...

    uvicorn.run(retriever_app, host="0.0.0.0", port=8000, log_level="debug")

Ensure that the OpenAI model you select is capable of handling Urdu text for both embeddings and language generation. Additionally, you can refer to the LlamaIndex documentation for more advanced use cases, such as multi-modal RAG systems, which might be useful if you plan to incorporate other data types like images along with text.

For a deeper understanding of the concepts and stages involved in building a RAG system, you can refer to the LlamaIndex documentation. It covers key stages such as loading, indexing, storing, querying, and evaluation, which are essential for creating efficient and accurate data-backed LLM applications.

Additionally, you can leverage the SentenceWindowRetrieverPack class from LlamaIndex for a more structured approach:

from llama_index.core import Document
from llama_index.packs.sentence_window_retriever.base import SentenceWindowRetrieverPack

# Load your documents
documents = [Document(text="Your Urdu text here")]

# Initialize the SentenceWindowRetrieverPack
retriever_pack = SentenceWindowRetrieverPack(docs=documents)

# Run the retriever
results = retriever_pack.run(query="Your query here")
print(results)

This class provides a comprehensive setup for building input nodes, creating a vector index, and retrieving relevant text, making it easier to implement a RAG system for your Urdu dataset ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

USTAADCOM commented 1 month ago

@dosu It's not working for urdu data.

USTAADCOM commented 1 month ago

My problem is that when i create a document embedding it create perfect but it store the urdu data in embeddings store as unicode as given below example.

"text_id_to_ref_doc_id": {"82193051-86f8-4b43-9212-df14e37b9897": "c8549dbf-25d4-4645-8bc9-7420d20160e7"}, "metadata_dict": {"82193051-86f8-4b43-9212-df14e37b9897": {"window": "\u063a\u06cc\u0631 \u0633\u0631\u06a9\u0627\u0631\u06cc \u0646\u062a\u0627\u0626\u062c \u06a9\u06d2 \u0645\u0637\u0627\u0628\u0642 \u0645\u0631\u06a9 \u0632 \u0627\u0648\u0631 \u0635\u0648\u0628\u06c1 \u062e\u06cc\u0628\u0631 \u067e\u062e\u062a\u0648\u0646 \u062e\u0648\u0627\u06c1 \u0645\u06cc\u06ba \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0646\u06d2 \u0645\u06cc\u062f\u0627\u0646 \u0645\u0627\u0631 \u0644\u06cc\u0627\u06d4 \u067e\u0646\u062c\u0627\u0628 \u0645\u06cc\u06ba \u0646\u0648\u0646 \u0644\u06cc\u06af \u0622\u06af\u06d2 \u0622\u06af\u06d2 \u06c1\u06d2\u06d4\u0633\u0646\u062f \u06be \u0645 \u0645\u06cc\u06ba \u067e\u0644\u0632 \u067e \u0631 \u0679\u0631 \u0646 \u0646\u06d2 \u06a9\u0627\u0627\u0628\u06cc \u062d \u0635\u0644 \u06a9 \u06d4 \u06d4 \u0627\u0686\u0627 \n\u0645\u06cc\u06ba \u0645\u0644\u06a9 \u062f\u0634\u0645\u0646 \u0627\u06cc\u0645 \u06a9\u06cc\u0648 \u0627\u06cc\u0645 \u06a9\u0627 \u0635\u0641\u0627\u0631\u06a9\u0679 \u06c1\u0648 \u06af\u06cc\u0627\u06d4 \u0628\u0644\u0648\u0686\u0633\u062a\u0627\u0646 \u0645\u06cc\u06ba \u0645\u0644\u0627 \u062c\u0644\u0627 \u0631\u062c\u062d\u0627\u0646 \u06c1\u06d2\u06d4 \u06cc\u06c1 \u0646\u062a\u0627\u0626\u062c \u0648\u06c1\u06cc \u06cc\u06ba \u0633 \u06d4 \u06cc\u0646 \u0627\u0627\u0642\u0648\u0627\u0627\u06cc \u06cc\u0688\u06cc\u0631\u06a9\u0679 \u0646\u06d2 \u06be\u06cc \u06cc\u0634\u06af\u0648\u0626\u06cc \u06d4 \u06be\u06cc\u06d4\u06d4 \u0631 \u0679\u0633\u062a\u0627 \u0646 \u0645\u06cc\u06ba \u06be\u06cc \u06cc\u06c1 \u06cc\u06c1 \u06a9\u0627\u0631 \u06a9\u06c1 \u0631\u06c1 \u0631\u06c1\u06d2 \u06a9\u06c1 \u0627\u0635 \u0627 \u06a9 \n\u0645\u0642\u0627\u0628\u0644\u06c1 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0627\u0648\u0631 \u0646\u0648\u0646 \u0644\u06cc\u06af \u0645\u06cc\u06ba \u06c1\u0648 \u06af\u0627 \u0627\u0648\u0631 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u06a9\u0627 \u067e\u0644\u0691\u0627 \u0628\u06be\u0627\u0631\u06cc \u0631\u06c1\u06d2 \u06af\u0627\u06d4 ", "original_text": "\u063a\u06cc\u0631 \u0633\u0631\u06a9\u0627\u0631\u06cc \u0646\u062a\u0627\u0626\u062c \u06a9\u06d2 \u0645\u0637\u0627\u0628\u0642 \u0645\u0631\u06a9 \u0632 \u0627\u0648\u0631 \u0635\u0648\u0628\u06c1 \u062e\u06cc\u0628\u0631 \u067e\u062e\u062a\u0648\u0646 \u062e\u0648\u0627\u06c1 \u0645\u06cc\u06ba \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0646\u06d2 \u0645\u06cc\u062f\u0627\u0646 \u0645\u0627\u0631 \u0644\u06cc\u0627\u06d4 \u067e\u0646\u062c\u0627\u0628 \u0645\u06cc\u06ba \u0646\u0648\u0646 \u0644\u06cc\u06af \u0622\u06af\u06d2 \u0622\u06af\u06d2 \u06c1\u06d2\u06d4\u0633\u0646\u062f \u06be \u0645 \u0645\u06cc\u06ba \u067e\u0644\u0632 \u067e \u0631 \u0679\u0631 \u0646 \u0646\u06d2 \u06a9\u0627\u0627\u0628\u06cc \u062d \u0635\u0644 \u06a9 \u06d4 \u06d4 \u0627\u0686\u0627 \n\u0645\u06cc\u06ba \u0645\u0644\u06a9 \u062f\u0634\u0645\u0646 \u0627\u06cc\u0645 \u06a9\u06cc\u0648 \u0627\u06cc\u0645 \u06a9\u0627 \u0635\u0641\u0627\u0631\u06a9\u0679 \u06c1\u0648 \u06af\u06cc\u0627\u06d4 \u0628\u0644\u0648\u0686\u0633\u062a\u0627\u0646 \u0645\u06cc\u06ba \u0645\u0644\u0627 \u062c\u0644\u0627 \u0631\u062c\u062d\u0627\u0646 \u06c1\u06d2\u06d4 \u06cc\u06c1 \u0646\u062a\u0627\u0626\u062c \u0648\u06c1\u06cc \u06cc\u06ba \u0633 \u06d4 \u06cc\u0646 \u0627\u0627\u0642\u0648\u0627\u0627\u06cc \u06cc\u0688\u06cc\u0631\u06a9\u0679 \u0646\u06d2 \u06be\u06cc \u06cc\u0634\u06af\u0648\u0626\u06cc \u06d4 \u06be\u06cc\u06d4\u06d4 \u0631 \u0679\u0633\u062a\u0627 \u0646 \u0645\u06cc\u06ba \u06be\u06cc \u06cc\u06c1 \u06cc\u06c1 \u06a9\u0627\u0631 \u06a9\u06c1 \u0631\u06c1 \u0631\u06c1\u06d2 \u06a9\u06c1 \u0627\u0635 \u0627 \u06a9 \n\u0645\u0642\u0627\u0628\u0644\u06c1 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0627\u0648\u0631 \u0646\u0648\u0646 \u0644\u06cc\u06af \u0645\u06cc\u06ba \u06c1\u0648 \u06af\u0627 \u0627\u0648\u0631 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u06a9\u0627 \u067e\u0644\u0691\u0627 \u0628\u06be\u0627\u0631\u06cc \u0631\u06c1\u06d2 \u06af\u0627\u06d4 ", "_node_type": "TextNode", "document_id": "c8549dbf-25d4-4645-8bc9-7420d20160e7", "doc_id": "c8549dbf-25d4-4645-8bc9-7420d20160e7", "ref_doc_id": "c8549dbf-25d4-4645-8bc9-7420d20160e7"}}}

USTAADCOM commented 1 month ago

And original text are غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلا رجحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔

hamza233 commented 1 month ago

You probably need an embedding model trained on Urdu text. Try this

USTAADCOM commented 1 month ago

@hamza233 I am using openai latest embedding model which supports the Urdu language as well.

hamza233 commented 1 month ago

You are getting unicode characters, all you need to do is decode it.

def decode_unicode_string(unicode_string):
    # Remove the 'u' prefix and quotes if present
    cleaned_string = unicode_string.strip("u'\"")

    # Replace Unicode escape sequences with their actual characters
    decoded_string = cleaned_string.encode('utf-8').decode('unicode_escape')

    return decoded_string

I tested it with value of originial_text and it returns this: غیر سرکاری نتائج کے مطابق مرک ز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔سند ھ م میں پلز پ ر ٹر ن نے کاابی ح صل ک ۔ ۔ اچا میں ملک دشمن ایم کیو ایم کا صفارکٹ ہو گیا۔ بلوچستان میں ملا جلا رجحان ہے۔ یہ نتائج وہی یں س ۔ ین ااقواای یڈیرکٹ نے ھی یشگوئی ۔ ھی۔۔ ر ٹستا ن میں ھی یہ یہ کار کہ رہ رہے کہ اص ا ک مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔

run-llama / llama_index

[Question]: Can I create the RAG for Urdu dataset using the same pattern as for English dateset. And I am using Openai latest multilanguage model for embeddings #14927

Question Validation

Question