[An Online Document Retrieval and Classication System based on Information Retrieval and Machine Learning Techniques]

yrm14 commented 1 month ago

Title

An Online Document Retrieval and Classication System based on Information Retrieval and Machine Learning Techniques

Team Name

IRFighters

Email

202103045@daiict.ac.in

Team Member 1 Name

Priyesh Tandel

Team Member 1 Id

202101222

Team Member 2 Name

Keertivardhan Goyal

Team Member 2 Id

202103007

Team Member 3 Name

Yash Mashru

Team Member 3 Id

202103045

Team Member 4 Name

Sanchit Satija

Team Member 4 Id

202103054

Problem Statement

The project aims to develop a web crawler and document classification system using Information Retrieval (IR) and Machine Learning (ML) techniques. The system focuses on retrieving and classifying specific types of documents, such as syllabi, from the web, using seed URLs to guide the crawler and a trained classifier to identify relevant pages.

Evaluation Strategy

The system will be evaluated using precision, recall, and F1-score metrics for the classifier. Cross-validation is performed on the training data to measure the classifier's performance. For overall system evaluation, a random sample of pages classified as syllabi is manually checked to estimate system precision. The estimated recall is based on the percentage of relevant documents in a random sample of all crawled pages.

Dataset

https://github.com/yrm14/IR-Datasets/blob/main/dataset_IR.csv

Resources

[1] Andrew Luxton-Reilly, Brett A Becker, Yingjun Cao, Roger McDermott, Claudio Mirolo, Andreas Muhling, Andrew Petersen, Kate Sanders, Simon, and Jacqueline Whalley. Developing Assessments to Determine Mastery of Programming Fundamentals. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE '17, page 388, New York, NY, USA, 2017. ACM.

[2] D. Eichmann. The RBSE spider Balancing effective search against Web load. Computer Networks and ISDN Systems, 27(2):308, 1994.

[3] Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 2:219{229, 1999.

[4] Leonard Richardson. Beautiful Soup Documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/, pages 1{72, 2016.

[5] Junghoo Cho. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7):161{172, 1998.

parth126 commented 1 month ago

[Psossibly GPT Generate Problem Statement]

The problem is vague. How do you plan to evaluate a dataset that is crawled? Will it have relevance judgements? What will be the contribution apart from crawling the dataset?

Try to answer these questions by tomorrow. In the current state, this problem definition is not acceptable.

yrm14 commented 1 month ago

Respected Prof. Parth Metha, my name is Yash Mashru from MnC batch 2021, student ID 202103045. I am talking in behalf of my group (group of Issue 5) . We are extremely sorry regarding the same. We would like to discuss with you our problem statement in detail by the means of Google Meet. If you would allow the same, it would be great. Hoping to hear from you soon.

Regards, Yash Mashru

On Tue, Sep 17, 2024, 19:26 Parth Mehta @.***> wrote:

[Psossibly GPT Generate Problem Statement]

The problem is vague. How do you plan to evaluate a dataset that is crawled? Will it have relevance judgements? What will be the contribution apart from crawling the dataset?

Try to answer these questions by tomorrow. In the current state, this problem definition is not acceptable.

— Reply to this email directly, view it on GitHub https://github.com/parth126/IT550/issues/5#issuecomment-2355899359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZUD6LGIES4PBFGPMVZRYMLZXAYJRAVCNFSM6AAAAABOJJ2GEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVHA4TSMZVHE . You are receiving this because you authored the thread.Message ID: @.***>

parth126 commented 1 month ago

Kindly email me for making appointments. This is not the right forum to do that.

parth126 / IT550