Document Tagging System for Information Retrieval and Relevance (Tagging and Text Summarization)

adi3025 commented 1 week ago

Title

Document Tagging System for Information Retrieval and Relevance

Team Name

DataNavigatorsss

Email

202103025@daiict.ac.in

Team Member 1 Name

Aditya Bhatt

Team Member 1 Id

202103025

Team Member 2 Name

Vansh Shah

Team Member 2 Id

202318051

Team Member 3 Name

Dhwani Gandhi

Team Member 3 Id

202311071

Team Member 4 Name

Swara Desai

Team Member 4 Id

202311005

Problem Statement

We will replace RAKE with transformer models like BERT to capture deeper contextual meaning for keyword extraction. We propose using supervised learning models, such as SVM or neural networks, trained on labeled datasets to enhance keyword precision and graph-based methods for domain classification. To improve tag diversity, we will experiment using word embedding clustering to generate more semantically rich tags. We will also include a comparison of the models we utilized.

Evaluation Strategy

1.Accuracy of Tagging: System Accuracy The precision and recall of the generated tags will be used for evaluating the system. We compare the system-generated tags against a gold standard dataset of documents with already pre-annotated tags. Precision reports the percentage of relevant tags assigned by the system, whereas recall evaluates how well it has captured all the relevant tags.

ROUGE-Precision: Assesses the proportion of system-generated tags that are relevant compared to the reference tags.
ROUGE-Recall: Measures the proportion of reference tags successfully captured by the system-generated tags.
ROUGE-F1: Provides a balanced evaluation between ROUGE-Precision and ROUGE-Recall. 2.Jaccard Similarity: This is a more flexible metric that measures the overlap between the set of system-generated tags and the reference tags. It calculates the ratio of the intersection to the union of the two tag sets. Jaccard Similarity is useful when partial matches are acceptable, and it provides a more lenient measure of tag accuracy. 3.Exact Match Ratio: This metric checks whether the system-generated tags exactly match the reference set of tags for a document. It is a strict measure that only rewards a complete and perfect match. While it can be useful for evaluating the overall correctness of tags, it may be overly stringent, as even minor differences will result in failure

Dataset

The dataset we are going to use is PubTag. It consists of key phrases extracted from research papers by 12 Computer Science professors, sourced from DBLP(Digital Bibliography and Library Project). It includes 1,126 key phrases manually scored for relevance ranging from 1 ((not relevant) to 5(very relevant) , evaluation of four learning-to-rank frameworks and Raw text data used for evaluation. This labeled dataset is used to train supervised learning models like SVM and neural networks and supports experiments with advanced methods such as BERT for keyword extraction and graph-based domain classification.

https://github.com/paulariosaraya/cstagclouds

Resources

Paper title -A model for auto-tagging of research papers based on key phrase extraction methods Paper link - https://ieeexplore.ieee.org/document/8126087 Paper title - PubTag research tag clouds Paper link - https://ieeexplore.ieee.org/document/8609671

parth126 commented 1 week ago

Problem statement needs to be better defined:

Is this a KP identification problem or KP based matching problem?
What Dataset will be used? Possible Good Read: https://aclanthology.org/2023.findings-eacl.161.pdf

parth126 commented 4 days ago

Suggested to find a good paper and implement this as a reproducibility project.

parth126 commented 4 days ago

RAKE and YAKE will be used as baselines. New approaches to beat that basline Please mention the dataset here

parth126 / IT550