parth126 / IT550

Project Proposals for the IT-550 Course (Autumn 2024)
0 stars 0 forks source link

Autoregressive Search Engines: Generating Substrings as Document Identifiers #11

Open Nisarg13 opened 2 months ago

Nisarg13 commented 2 months ago

Title

Autoregressive Search Engines: Generating Substrings as Document Identifiers

Team Name

Autoregressive Seekers

Email

nisargganatra13@gmail.com

Team Member 1 Name

Nisarg Ganatra

Team Member 1 Id

202311018

Team Member 2 Name

Harsh Raval

Team Member 2 Id

202311028

Team Member 3 Name

Shubham Shah

Team Member 3 Id

202311049

Team Member 4 Name

Dhyey Bhimani

Team Member 4 Id

202311004

Category

Reproducibility

Problem Statement

Traditional document retrieval systems, such as those relying on predefined identifiers (e.g., titles or hierarchical structures) or dense vector embeddings, are limited in their ability to precisely retrieve relevant passages based on user queries. These systems struggle to fully exploit the power of modern autoregressive language models, which are capable of understanding word order and generating contextually relevant substrings (ngrams). Additionally, many current approaches suffer from high memory usage, especially when dealing with large document corpora.

Evaluation Strategy

The outcome will be evaluated by measuring retrieval accuracy using metrics like Precision@k and R-Precision, ensuring the system retrieves relevant passages compared to ground-truth data. Memory efficiency will be assessed by comparing SEAL’s FM-Index footprint with traditional vector-based systems, aiming for lower memory usage. Additionally, performance will be benchmarked on datasets like KILT and Natural Questions, and the quality of retrieved passages will be tested on downstream tasks like question answering, ensuring improvements over existing methods.

Dataset

https://github.com/facebookresearch/seal

Resources

Paper Title: Autoregressive Search Engines: Generating Substrings as Document Identifiers Paper link : https://arxiv.org/abs/2204.10628

parth126 commented 2 months ago
shubham-591 commented 2 months ago

1) KILT (Knowledge Intensive Language Tasks) is a benchmark that brings together several NLP tasks that require retrieving knowledge from a large corpus. The main focus is on knowledge-intensive tasks such as question answering, fact-checking, dialogue, slot filling, and entity linking, all of which rely on extracting information from a unified corpus (Wikipedia dump). The benchmark focuses on tasks that closely resemble real-world applications where retrieving precise and relevant information is crucial.

2) The paper addresses the challenge of improving retrieval precision by focusing on passage-level retrieval. Instead of retrieving entire documents, the aim is to retrieve specific passages or n-grams (substrings) from documents that directly answer a query or provide the needed information.