Open USTAADCOM opened 1 month ago
Hey @USTAADCOM, I'm here to assist you with any questions or issues you may have. Feel free to ask for help!
Yes, you can use the same pattern to create the Retrieval-Augmented Generation (RAG) system for the Urdu dataset using the build_sentence_window_index
function and OpenAI's latest multilingual model for embeddings. The provided code demonstrates how to build a RAG system using the SentenceSplitter
for sentence-level parsing and OpenAI's embedding model. You can adapt this pattern for your Urdu dataset by ensuring that the OpenAI model you use supports Urdu.
Here is the adapted code:
from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
# create the pipeline with transformations
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
]
)
# build the index
loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()
nodes = pipeline.run(documents=documents, show_progress=True)
# models
llm = OpenAI()
embed_model = OpenAIEmbedding()
# build RAG
index = VectorStoreIndex(nodes=nodes, embed_model=embed_model)
retriever = index.as_retriever()
if __name__ == "__main__":
from llama_index.networks.contributor.retriever import ContributorRetrieverService
import uvicorn
retriever_service = ContributorRetrieverService.from_config_file(
".env.contributor.service", retriever=retriever
)
retriever_app = retriever_service.app
# can add own endpoints or security to app
# @app.get("...")
# async def custom_endpoint_logic():
# ...
uvicorn.run(retriever_app, host="0.0.0.0", port=8000, log_level="debug")
Ensure that the OpenAI model you select is capable of handling Urdu text for both embeddings and language generation. Additionally, you can refer to the LlamaIndex documentation for more advanced use cases, such as multi-modal RAG systems, which might be useful if you plan to incorporate other data types like images along with text.
For a deeper understanding of the concepts and stages involved in building a RAG system, you can refer to the LlamaIndex documentation. It covers key stages such as loading, indexing, storing, querying, and evaluation, which are essential for creating efficient and accurate data-backed LLM applications.
Additionally, you can leverage the SentenceWindowRetrieverPack
class from LlamaIndex for a more structured approach:
from llama_index.core import Document
from llama_index.packs.sentence_window_retriever.base import SentenceWindowRetrieverPack
# Load your documents
documents = [Document(text="Your Urdu text here")]
# Initialize the SentenceWindowRetrieverPack
retriever_pack = SentenceWindowRetrieverPack(docs=documents)
# Run the retriever
results = retriever_pack.run(query="Your query here")
print(results)
This class provides a comprehensive setup for building input nodes, creating a vector index, and retrieving relevant text, making it easier to implement a RAG system for your Urdu dataset [1][2][3][4][5].
To continue talking to Dosu, mention @dosu.
@dosu It's not working for urdu data.
My problem is that when i create a document embedding it create perfect but it store the urdu data in embeddings store as unicode as given below example.
"text_id_to_ref_doc_id": {"82193051-86f8-4b43-9212-df14e37b9897": "c8549dbf-25d4-4645-8bc9-7420d20160e7"}, "metadata_dict": {"82193051-86f8-4b43-9212-df14e37b9897": {"window": "\u063a\u06cc\u0631 \u0633\u0631\u06a9\u0627\u0631\u06cc \u0646\u062a\u0627\u0626\u062c \u06a9\u06d2 \u0645\u0637\u0627\u0628\u0642 \u0645\u0631\u06a9 \u0632 \u0627\u0648\u0631 \u0635\u0648\u0628\u06c1 \u062e\u06cc\u0628\u0631 \u067e\u062e\u062a\u0648\u0646 \u062e\u0648\u0627\u06c1 \u0645\u06cc\u06ba \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0646\u06d2 \u0645\u06cc\u062f\u0627\u0646 \u0645\u0627\u0631 \u0644\u06cc\u0627\u06d4 \u067e\u0646\u062c\u0627\u0628 \u0645\u06cc\u06ba \u0646\u0648\u0646 \u0644\u06cc\u06af \u0622\u06af\u06d2 \u0622\u06af\u06d2 \u06c1\u06d2\u06d4\u0633\u0646\u062f \u06be \u0645 \u0645\u06cc\u06ba \u067e\u0644\u0632 \u067e \u0631 \u0679\u0631 \u0646 \u0646\u06d2 \u06a9\u0627\u0627\u0628\u06cc \u062d \u0635\u0644 \u06a9 \u06d4 \u06d4 \u0627\u0686\u0627 \n\u0645\u06cc\u06ba \u0645\u0644\u06a9 \u062f\u0634\u0645\u0646 \u0627\u06cc\u0645 \u06a9\u06cc\u0648 \u0627\u06cc\u0645 \u06a9\u0627 \u0635\u0641\u0627\u0631\u06a9\u0679 \u06c1\u0648 \u06af\u06cc\u0627\u06d4 \u0628\u0644\u0648\u0686\u0633\u062a\u0627\u0646 \u0645\u06cc\u06ba \u0645\u0644\u0627 \u062c\u0644\u0627 \u0631\u062c\u062d\u0627\u0646 \u06c1\u06d2\u06d4 \u06cc\u06c1 \u0646\u062a\u0627\u0626\u062c \u0648\u06c1\u06cc \u06cc\u06ba \u0633 \u06d4 \u06cc\u0646 \u0627\u0627\u0642\u0648\u0627\u0627\u06cc \u06cc\u0688\u06cc\u0631\u06a9\u0679 \u0646\u06d2 \u06be\u06cc \u06cc\u0634\u06af\u0648\u0626\u06cc \u06d4 \u06be\u06cc\u06d4\u06d4 \u0631 \u0679\u0633\u062a\u0627 \u0646 \u0645\u06cc\u06ba \u06be\u06cc \u06cc\u06c1 \u06cc\u06c1 \u06a9\u0627\u0631 \u06a9\u06c1 \u0631\u06c1 \u0631\u06c1\u06d2 \u06a9\u06c1 \u0627\u0635 \u0627 \u06a9 \n\u0645\u0642\u0627\u0628\u0644\u06c1 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0627\u0648\u0631 \u0646\u0648\u0646 \u0644\u06cc\u06af \u0645\u06cc\u06ba \u06c1\u0648 \u06af\u0627 \u0627\u0648\u0631 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u06a9\u0627 \u067e\u0644\u0691\u0627 \u0628\u06be\u0627\u0631\u06cc \u0631\u06c1\u06d2 \u06af\u0627\u06d4 ", "original_text": "\u063a\u06cc\u0631 \u0633\u0631\u06a9\u0627\u0631\u06cc \u0646\u062a\u0627\u0626\u062c \u06a9\u06d2 \u0645\u0637\u0627\u0628\u0642 \u0645\u0631\u06a9 \u0632 \u0627\u0648\u0631 \u0635\u0648\u0628\u06c1 \u062e\u06cc\u0628\u0631 \u067e\u062e\u062a\u0648\u0646 \u062e\u0648\u0627\u06c1 \u0645\u06cc\u06ba \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0646\u06d2 \u0645\u06cc\u062f\u0627\u0646 \u0645\u0627\u0631 \u0644\u06cc\u0627\u06d4 \u067e\u0646\u062c\u0627\u0628 \u0645\u06cc\u06ba \u0646\u0648\u0646 \u0644\u06cc\u06af \u0622\u06af\u06d2 \u0622\u06af\u06d2 \u06c1\u06d2\u06d4\u0633\u0646\u062f \u06be \u0645 \u0645\u06cc\u06ba \u067e\u0644\u0632 \u067e \u0631 \u0679\u0631 \u0646 \u0646\u06d2 \u06a9\u0627\u0627\u0628\u06cc \u062d \u0635\u0644 \u06a9 \u06d4 \u06d4 \u0627\u0686\u0627 \n\u0645\u06cc\u06ba \u0645\u0644\u06a9 \u062f\u0634\u0645\u0646 \u0627\u06cc\u0645 \u06a9\u06cc\u0648 \u0627\u06cc\u0645 \u06a9\u0627 \u0635\u0641\u0627\u0631\u06a9\u0679 \u06c1\u0648 \u06af\u06cc\u0627\u06d4 \u0628\u0644\u0648\u0686\u0633\u062a\u0627\u0646 \u0645\u06cc\u06ba \u0645\u0644\u0627 \u062c\u0644\u0627 \u0631\u062c\u062d\u0627\u0646 \u06c1\u06d2\u06d4 \u06cc\u06c1 \u0646\u062a\u0627\u0626\u062c \u0648\u06c1\u06cc \u06cc\u06ba \u0633 \u06d4 \u06cc\u0646 \u0627\u0627\u0642\u0648\u0627\u0627\u06cc \u06cc\u0688\u06cc\u0631\u06a9\u0679 \u0646\u06d2 \u06be\u06cc \u06cc\u0634\u06af\u0648\u0626\u06cc \u06d4 \u06be\u06cc\u06d4\u06d4 \u0631 \u0679\u0633\u062a\u0627 \u0646 \u0645\u06cc\u06ba \u06be\u06cc \u06cc\u06c1 \u06cc\u06c1 \u06a9\u0627\u0631 \u06a9\u06c1 \u0631\u06c1 \u0631\u06c1\u06d2 \u06a9\u06c1 \u0627\u0635 \u0627 \u06a9 \n\u0645\u0642\u0627\u0628\u0644\u06c1 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u0627\u0648\u0631 \u0646\u0648\u0646 \u0644\u06cc\u06af \u0645\u06cc\u06ba \u06c1\u0648 \u06af\u0627 \u0627\u0648\u0631 \u062a\u062d\u0631\u06cc\u06a9 \u0627\u0646\u0635\u0627\u0641 \u06a9\u0627 \u067e\u0644\u0691\u0627 \u0628\u06be\u0627\u0631\u06cc \u0631\u06c1\u06d2 \u06af\u0627\u06d4 ", "_node_type": "TextNode", "document_id": "c8549dbf-25d4-4645-8bc9-7420d20160e7", "doc_id": "c8549dbf-25d4-4645-8bc9-7420d20160e7", "ref_doc_id": "c8549dbf-25d4-4645-8bc9-7420d20160e7"}}}
And original text are غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلا رجحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔
@hamza233 I am using openai latest embedding model which supports the Urdu language as well.
You are getting unicode characters, all you need to do is decode it.
def decode_unicode_string(unicode_string):
# Remove the 'u' prefix and quotes if present
cleaned_string = unicode_string.strip("u'\"")
# Replace Unicode escape sequences with their actual characters
decoded_string = cleaned_string.encode('utf-8').decode('unicode_escape')
return decoded_string
I tested it with value of originial_text
and it returns this:
غیر سرکاری نتائج کے مطابق مرک ز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔سند ھ م میں پلز پ ر ٹر ن نے کاابی ح صل ک ۔ ۔ اچا میں ملک دشمن ایم کیو ایم کا صفارکٹ ہو گیا۔ بلوچستان میں ملا جلا رجحان ہے۔ یہ نتائج وہی یں س ۔ ین ااقواای یڈیرکٹ نے ھی یشگوئی ۔ ھی۔۔ ر ٹستا ن میں ھی یہ یہ کار کہ رہ رہے کہ اص ا ک مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔
Question Validation
Question
I want to create RAG on Urdu dataset. I am already using the Llama index for English documents RAG using build_sentence_window_index.
def create_db_for_sentence_window_retrieval(): """ doc string """ path = "data" global query_engine sentence_index = build_sentence_window_index( path, llm, embed_model = embed_model, # text-embedding-3-large save_dir = PERSIST_DIR_WINDOW_RETRIEVAL ) query_engine = get_sentence_window_query_engine(qa_template, sentence_index, rerank_top_n = 3)