Open adi3025 opened 2 months ago
Problem statement needs to be better defined:
Suggested to find a good paper and implement this as a reproducibility project.
RAKE and YAKE will be used as baselines. New approaches to beat that basline Please mention the dataset here
Title
Document Tagging System for Information Retrieval and Relevance
Team Name
DataNavigatorsss
Email
202103025@daiict.ac.in
Team Member 1 Name
Aditya Bhatt
Team Member 1 Id
202103025
Team Member 2 Name
Vansh Shah
Team Member 2 Id
202318051
Team Member 3 Name
Dhwani Gandhi
Team Member 3 Id
202311071
Team Member 4 Name
Swara Desai
Team Member 4 Id
202311005
Category
Reproducibility
Problem Statement
We will replace RAKE with transformer models like BERT to capture deeper contextual meaning for keyword extraction. We propose using supervised learning models, such as SVM or neural networks, trained on labeled datasets to enhance keyword precision and graph-based methods for domain classification. To improve tag diversity, we will experiment using word embedding clustering to generate more semantically rich tags. We will also include a comparison of the models we utilized.
Evaluation Strategy
1.Accuracy of Tagging: System Accuracy The precision and recall of the generated tags will be used for evaluating the system. We compare the system-generated tags against a gold standard dataset of documents with already pre-annotated tags. Precision reports the percentage of relevant tags assigned by the system, whereas recall evaluates how well it has captured all the relevant tags.
Dataset
The dataset we are going to use is PubTag. It consists of key phrases extracted from research papers by 12 Computer Science professors, sourced from DBLP(Digital Bibliography and Library Project). It includes 1,126 key phrases manually scored for relevance ranging from 1 ((not relevant) to 5(very relevant) , evaluation of four learning-to-rank frameworks and Raw text data used for evaluation. This labeled dataset is used to train supervised learning models like SVM and neural networks and supports experiments with advanced methods such as BERT for keyword extraction and graph-based domain classification.
https://github.com/paulariosaraya/cstagclouds
Resources
Paper title -A model for auto-tagging of research papers based on key phrase extraction methods Paper link - https://ieeexplore.ieee.org/document/8126087 Paper title - PubTag research tag clouds Paper link - https://ieeexplore.ieee.org/document/8609671