Open Nisarg13 opened 2 months ago
1) KILT (Knowledge Intensive Language Tasks) is a benchmark that brings together several NLP tasks that require retrieving knowledge from a large corpus. The main focus is on knowledge-intensive tasks such as question answering, fact-checking, dialogue, slot filling, and entity linking, all of which rely on extracting information from a unified corpus (Wikipedia dump). The benchmark focuses on tasks that closely resemble real-world applications where retrieving precise and relevant information is crucial.
2) The paper addresses the challenge of improving retrieval precision by focusing on passage-level retrieval. Instead of retrieving entire documents, the aim is to retrieve specific passages or n-grams (substrings) from documents that directly answer a query or provide the needed information.
Title
Autoregressive Search Engines: Generating Substrings as Document Identifiers
Team Name
Autoregressive Seekers
Email
nisargganatra13@gmail.com
Team Member 1 Name
Nisarg Ganatra
Team Member 1 Id
202311018
Team Member 2 Name
Harsh Raval
Team Member 2 Id
202311028
Team Member 3 Name
Shubham Shah
Team Member 3 Id
202311049
Team Member 4 Name
Dhyey Bhimani
Team Member 4 Id
202311004
Category
Reproducibility
Problem Statement
Traditional document retrieval systems, such as those relying on predefined identifiers (e.g., titles or hierarchical structures) or dense vector embeddings, are limited in their ability to precisely retrieve relevant passages based on user queries. These systems struggle to fully exploit the power of modern autoregressive language models, which are capable of understanding word order and generating contextually relevant substrings (ngrams). Additionally, many current approaches suffer from high memory usage, especially when dealing with large document corpora.
Evaluation Strategy
The outcome will be evaluated by measuring retrieval accuracy using metrics like Precision@k and R-Precision, ensuring the system retrieves relevant passages compared to ground-truth data. Memory efficiency will be assessed by comparing SEAL’s FM-Index footprint with traditional vector-based systems, aiming for lower memory usage. Additionally, performance will be benchmarked on datasets like KILT and Natural Questions, and the quality of retrieved passages will be tested on downstream tasks like question answering, ensuring improvements over existing methods.
Dataset
https://github.com/facebookresearch/seal
Resources
Paper Title: Autoregressive Search Engines: Generating Substrings as Document Identifiers Paper link : https://arxiv.org/abs/2204.10628