parth126 / IT550

Project Proposals for the IT-550 Course (Autumn 2024)
0 stars 0 forks source link

Document Tagging System for Information Retrieval and Relevance (Tagging and Text Summarization) #22

Open adi3025 opened 2 months ago

adi3025 commented 2 months ago

Title

Document Tagging System for Information Retrieval and Relevance

Team Name

DataNavigatorsss

Email

202103025@daiict.ac.in

Team Member 1 Name

Aditya Bhatt

Team Member 1 Id

202103025

Team Member 2 Name

Vansh Shah

Team Member 2 Id

202318051

Team Member 3 Name

Dhwani Gandhi

Team Member 3 Id

202311071

Team Member 4 Name

Swara Desai

Team Member 4 Id

202311005

Category

Reproducibility

Problem Statement

We will replace RAKE with transformer models like BERT to capture deeper contextual meaning for keyword extraction. We propose using supervised learning models, such as SVM or neural networks, trained on labeled datasets to enhance keyword precision and graph-based methods for domain classification. To improve tag diversity, we will experiment using word embedding clustering to generate more semantically rich tags. We will also include a comparison of the models we utilized.

Evaluation Strategy

1.Accuracy of Tagging: System Accuracy The precision and recall of the generated tags will be used for evaluating the system. We compare the system-generated tags against a gold standard dataset of documents with already pre-annotated tags. Precision reports the percentage of relevant tags assigned by the system, whereas recall evaluates how well it has captured all the relevant tags.

Dataset

The dataset we are going to use is PubTag. It consists of key phrases extracted from research papers by 12 Computer Science professors, sourced from DBLP(Digital Bibliography and Library Project). It includes 1,126 key phrases manually scored for relevance ranging from 1 ((not relevant) to 5(very relevant) , evaluation of four learning-to-rank frameworks and Raw text data used for evaluation. This labeled dataset is used to train supervised learning models like SVM and neural networks and supports experiments with advanced methods such as BERT for keyword extraction and graph-based domain classification.

https://github.com/paulariosaraya/cstagclouds

Resources

Paper title -A model for auto-tagging of research papers based on key phrase extraction methods Paper link - https://ieeexplore.ieee.org/document/8126087 Paper title - PubTag research tag clouds Paper link - https://ieeexplore.ieee.org/document/8609671

parth126 commented 2 months ago

Problem statement needs to be better defined:

parth126 commented 2 months ago

Suggested to find a good paper and implement this as a reproducibility project.

parth126 commented 2 months ago

RAKE and YAKE will be used as baselines. New approaches to beat that basline Please mention the dataset here