Documents Q&A Assistant - (Team Believers)

Sobhan-202318040 commented 2 days ago

Title

Documents Q&A Assistant: Integrating Chatbot Technology for Efficient Information Retrieval using RAG

Team Name

Team Believers

Email

202318040@daiict.ac.in

Team Member 1 Name

Kushal Barot

Team Member 1 Id

202318006

Team Member 2 Name

Aayush Vithalani

Team Member 2 Id

202318023

Team Member 3 Name

Harshil Shah

Team Member 3 Id

202318033

Team Member 4 Name

Sobhan Behuria

Team Member 4 Id

202318040

Problem Statement

This project aims to improve the retrieval of relevant information from large PDF documents, addressing the limitations of traditional keyword searches by using vector embeddings and advanced AI for precise, context-aware answers.

Evaluation Strategy

The evaluation strategy includes measuring precision and recall for information retrieval accuracy, assessing response relevance and user satisfaction, testing retrieval speed and scalability, evaluating query handling for complexity, and analyzing usability and error rates for system performance.

Dataset

The project involves handling diverse and user-specific PDFs, which means evaluation is based on actual system performance with real documents rather than a static dataset. The system’s effectiveness is assessed through its ability to process and respond to various user queries across different types of PDFs, making a fixed dataset insufficient for comprehensive evaluation.

Resources

[1] Petr Baudiš and Jan Šedivy. Modeling of the question answering task in the yodaqa system. In ` International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer, 2015. URL https://link.springer.com/chapter/10.1007% 2F978-3-319-24027-5_20. [2] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ D13-1160. [3] X. Cheng, D. Luo, X. Chen, L. Liu, D. Zhao, and R. Yan, “Lift yourself up: Retrieval-augmented text generation with self memory,” arXiv preprint arXiv:2305.02437, 2023. [4]↑S. Wang, Y. Xu, Y. Fang, Y. Liu, S. Sun, R. Xu, C. Zhu, and M. Zeng, “Training data is more valuable than you think: A simple and effective method by retrieving from training data,” arXiv preprint arXiv:2203.08773, 2022.

parth126 commented 1 day ago

Issues in the proposal:

Dataset: No project is valid without a valid dataset
Too broad a problem statement. How is it different from retrieving text documents? Are there any pdf specific problems? Will the pdfs be scanned or digital?
What approach will be used for matching? Will it be done within terrier, elastic, etc? Or implemented from scratch?

Try to answer these questions by tomorrow. In the current state, this problem definition is not acceptable.

Sobhan-202318040 commented 21 hours ago

1.Dataset: No project is valid without a valid dataset: We understand the need for a valid dataset, and we plan to work with a collection of digital PDF documents. This will allow us to evaluate our system’s ability to efficiently extract and retrieve information from PDFs with various structures and contents. We may also consider publicly available datasets from sources like Kaggle or research repositories to further enhance the dataset.

(i)Too broad a problem statement: We agree that the problem statement could be more focused. Our project will concentrate on addressing the challenges of extracting and retrieving contextually accurate information from large digital PDF documents, particularly where traditional search methods struggle with complex structures such as tables and charts.

(ii)How is it different from retrieving text documents?: This project is different from standard text retrieval because PDFs often present challenges such as:

Complex layouts: PDFs include tables, charts, and multi-column text, which complicate extraction.
Embedded elements: PDFs can contain embedded fonts and non-linear text structures that require special handling. We are planning to build a system that will specifically tackle these PDF-related challenges by using vector embeddings to enhance the accuracy and relevance of content retrieval.

(iii)Are there any PDF-specific problems?: Yes, PDFs have unique issues, such as:

Complex layouts: Extracting text from tables, charts, and multi-column documents requires specific handling.
Formatting and structure: PDFs often have non-linear or multi-layered text, making it harder to extract information in a meaningful order.

(iv)Will the PDFs be scanned or digital?: Our primary focus will be on digital PDFs for this project, ensuring efficient extraction and retrieval. However, if time permits, we plan to incorporate OCR technology to handle scanned PDFs as well, allowing us to broaden the scope and include both types of documents.

3.What approach will be used for matching? Will it be done within Terrier, Elasticsearch, etc.? Or implemented from scratch?:

We will implement the matching process using vector embeddings, converting text into vectors for semantic search (most probably FAISS ). This enables context-aware retrieval from PDFs. We may integrate with Elasticsearch or Pinecone for scalable indexing and faster querying, but the core functionality will rely on Langchain and Google Gemini for semantic understanding

parth126 commented 20 hours ago

@Sobhan-202318040 Please upload a block diagram of the entire end to end system for better explanation. You can attach it here. There are several issues with your explanation, and putting the entire pipeline on paper will help you identify those.

parth126 / IT550